Python 通过将观测值划分为比率在数据帧中进行下采样
给定一个具有时间戳(ts)的数据帧,我希望按小时(下采样)来计算这些数据帧。以前由ts索引的值现在应根据一小时内剩余的分钟数划分为比率。[注意:在进行重采样时,按NaN列的比率分割数据]Python 通过将观测值划分为比率在数据帧中进行下采样,python,pandas,dataframe,resampling,pandas-resample,Python,Pandas,Dataframe,Resampling,Pandas Resample,给定一个具有时间戳(ts)的数据帧,我希望按小时(下采样)来计算这些数据帧。以前由ts索引的值现在应根据一小时内剩余的分钟数划分为比率。[注意:在进行重采样时,按NaN列的比率分割数据] ts event duration 0 2020-09-09 21:01:00 a 12 1 2020-09-10 00:10:00 a 22 2 2020-09-10 01:31:00 a 130
ts event duration
0 2020-09-09 21:01:00 a 12
1 2020-09-10 00:10:00 a 22
2 2020-09-10 01:31:00 a 130
3 2020-09-10 01:50:00 b 60
4 2020-09-10 01:51:00 b 50
5 2020-09-10 01:59:00 b 26
6 2020-09-10 02:01:00 c 72
7 2020-09-10 02:51:00 b 51
8 2020-09-10 03:01:00 b 63
9 2020-09-10 04:01:00 c 79
def create_dataframe():
df = pd.DataFrame([{'duration':12, 'event':'a', 'ts':'2020-09-09 21:01:00'},
{'duration':22, 'event':'a', 'ts':'2020-09-10 00:10:00'},
{'duration':130, 'event':'a', 'ts':'2020-09-10 01:31:00'},
{'duration':60, 'event':'b', 'ts':'2020-09-10 01:50:00'},
{'duration':50, 'event':'b', 'ts':'2020-09-10 01:51:00'},
{'duration':26, 'event':'b', 'ts':'2020-09-10 01:59:00'},
{'duration':72, 'event':'c', 'ts':'2020-09-10 02:01:00'},
{'duration':51, 'event':'b', 'ts':'2020-09-10 02:51:00'},
{'duration':63, 'event':'b', 'ts':'2020-09-10 03:01:00'},
{'duration':79, 'event':'c', 'ts':'2020-09-10 04:01:00'},
{'duration':179, 'event':'c', 'ts':'2020-09-10 06:05:00'},
])
df.ts = pd.to_datetime(df.ts)
return df
我想根据花费的时间和生产的时间的比率来估算生产的成本。这可以与已完成的代码行数进行比较,或者找出每小时实际完成的代码行数?
例如:在“2020-09-10 00:10:00”中,我们有22个。然后在21:01-00:10期间,我们根据
59 min of 21:00 hours -> 7 => =ROUND(22/189*59,0)
60 min of 22:00 hours -> 7 => =ROUND(22/189*60,0)
60 min of 23:00 hours -> 7 => =ROUND(22/189*60,0)
10 min of 00:00 hours -> 1 => =ROUND(22/189*10,0)
结果应该是这样的
ts event duration
0 2020-09-09 20:00:00 a NaN
1 2020-09-10 21:00:00 a 7
2 2020-09-10 22:00:00 a 7
3 2020-09-10 23:00:00 a 7
4 2020-09-10 00:00:00 a 1
5 2020-09-10 01:00:00 b ..
6 2020-09-10 02:01:00 c ..
这种方法的问题是:
在我看来,我们对这种做法存在严重问题。如果你看一下[1]->2020-09-10 07:00:00行,我们有4行,我们需要在3个小时之间划分。考虑到基本持续时间值为1(基本单位),我们得到
def create_dataframe2():
df = pd.DataFrame([{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 07:00:00'},
{'duration':4, 'event':'c', 'c':'event3.5', 'ts':'2020-09-10 10:00:00'}])
df.ts = pd.to_datetime(df.ts)
return df
来源
预期产量
我在
pandas
中找不到解决方案,所以我用纯python创建了一个解决方案
基本上,我是在对所有值进行排序并将两个日期时间(即start\u time
和end\u time
发送到一个函数后进行迭代,该函数进行处理
def get_ratio_per_hour(start_time: list, end_time: list, data_: int):
# get total hours between the start and end, use this for looping
totalhrs = lambda x: [1 for _ in range(int(x // 3600))
] + [
(x % 3600 / 3600
or 0.1 # added for loop fix afterwards
)]
# check if Start and End are not in same hour
if start_time.hour != end_time.hour:
seconds = (end_time - start_time).total_seconds()
if seconds < 3600:
parts_ = [1] + totalhrs(seconds)
else:
parts_ = totalhrs(seconds)
else:
# parts_ define the loop iterations
parts_ = totalhrs((end_time - start_time).total_seconds())
sum_of_hrs = sum(parts_)
# for Constructing DF
new_hours = []
mins = []
# Clone data
start_time_ = start_time
end_time_ = end_time
for e in range(len(parts_)):
# print(parts_[e])
if sum_of_hrs != 0:
if sum_of_hrs > 1:
if end_time_.hour != start_time_.hour:
# Floor > based on the startTime +1 hour
floor_time = (start_time_ + timedelta(hours=1)).floor('H')
#
new_hours.append(start_time_.floor('H'))
mins.append((floor_time - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
# Hour is same.
floor_time = (start_time_ + timedelta(hours=1)).floor('H')
new_hours.append(start_time_.floor('H'))
mins.append((floor_time - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
if end_time_.hour != start_time_.hour:
# Get round off hour
floor_time = (end_time_ + timedelta(hours=1)).floor('H')
new_hours.append(end_time_.floor('H'))
mins.append(60 - ((floor_time - end_time_).total_seconds() // 60)
)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
else:
# Hour is same.
floor_time = (end_time_ + timedelta(hours=1)).floor('H')
new_hours.append(end_time_.floor('H'))
mins.append((end_time_ - start_time_).total_seconds() // 60)
sum_of_hrs = sum_of_hrs - 1
start_time_ = floor_time
# Get DataFrame Build
df_out = pd.DataFrame()
df_out['hours'] = pd.Series(new_hours)
df_out['mins'] = pd.Series(mins)
df_out['ratios'] = round(data_ / sum(mins) * df_out['mins'])
return df_out
中间数据帧:
df['tsPrev'] = df.ts.shift()
result = result.reset_index().rename(columns={'index': 'ts'})
最终输出:
第一步是将“previous ts”列添加到源数据帧:
df['tsPrev'] = df.ts.shift()
result = result.reset_index().rename(columns={'index': 'ts'})
然后将ts列设置为索引:
df.set_index('ts', inplace=True)
第三步是创建一个辅助索引,由原始索引组成
索引和“全时”:
然后创建一个辅助数据帧,用刚刚创建的索引重新编制索引
和“回填”事件栏:
定义要应用于df2中每组行的函数:
然后分两步生成“生产”列的源数据:
prodDet = df2.groupby(np.isfinite(df2.duration.values[::-1]).cumsum()[::-1],
sort=False).apply(parts).reset_index(level=0, drop=True)
源是按每个组终止的方式分组的df2
在duration列中具有非空值的行。然后每组
具有零件加工功能
结果是:
2020-09-09 21:00:00 12
2020-09-09 21:01:00 7
2020-09-09 22:00:00 7
2020-09-09 23:00:00 7
2020-09-10 00:00:00 1
2020-09-10 00:10:00 80
2020-09-10 01:00:00 50
2020-09-10 01:31:00 60
2020-09-10 01:50:00 50
2020-09-10 01:51:00 26
2020-09-10 01:59:00 36
2020-09-10 02:00:00 36
2020-09-10 02:01:00 51
2020-09-10 02:51:00 57
2020-09-10 03:00:00 6
2020-09-10 03:01:00 78
2020-09-10 04:00:00 1
2020-09-10 04:01:00 85
2020-09-10 05:00:00 87
2020-09-10 06:00:00 7
dtype: int32
2020-09-09 21:00:00 19
2020-09-09 22:00:00 7
2020-09-09 23:00:00 7
2020-09-10 00:00:00 81
2020-09-10 01:00:00 222
2020-09-10 02:00:00 144
2020-09-10 03:00:00 84
2020-09-10 04:00:00 86
2020-09-10 05:00:00 87
2020-09-10 06:00:00 7
Freq: H, Name: produced, dtype: int32
ts produced event tsMin
0 2020-09-09 21:00:00 19 a 2020-09-09 21:01:00
1 2020-09-09 22:00:00 7 a NaT
2 2020-09-09 23:00:00 7 a NaT
3 2020-09-10 00:00:00 81 a 2020-09-10 00:10:00
4 2020-09-10 01:00:00 222 a 2020-09-10 01:31:00
5 2020-09-10 02:00:00 144 c 2020-09-10 02:01:00
6 2020-09-10 03:00:00 84 b 2020-09-10 03:01:00
7 2020-09-10 04:00:00 86 c 2020-09-10 04:01:00
8 2020-09-10 05:00:00 87 c NaT
9 2020-09-10 06:00:00 7 c 2020-09-10 06:05:00
2020-09-09 21:00:00 12
2020-09-09 21:01:00 7
2020-09-09 22:00:00 7
2020-09-09 23:00:00 7
2020-09-10 00:00:00 1
2020-09-10 00:10:00 80
2020-09-10 01:00:00 50
2020-09-10 01:31:00 60
2020-09-10 01:50:00 50
2020-09-10 01:51:00 26
2020-09-10 01:59:00 36
2020-09-10 02:00:00 36
2020-09-10 02:01:00 51
2020-09-10 02:51:00 57
2020-09-10 03:00:00 6
2020-09-10 03:01:00 78
2020-09-10 04:00:00 1
2020-09-10 04:01:00 85
2020-09-10 05:00:00 87
2020-09-10 06:00:00 7
dtype: int32
2020-09-09 21:00:00 19
2020-09-09 22:00:00 7
2020-09-09 23:00:00 7
2020-09-10 00:00:00 81
2020-09-10 01:00:00 222
2020-09-10 02:00:00 144
2020-09-10 03:00:00 84
2020-09-10 04:00:00 86
2020-09-10 05:00:00 87
2020-09-10 06:00:00 7
Freq: H, Name: produced, dtype: int32
ts produced event tsMin
0 2020-09-09 21:00:00 19 a 2020-09-09 21:01:00
1 2020-09-09 22:00:00 7 a NaT
2 2020-09-09 23:00:00 7 a NaT
3 2020-09-10 00:00:00 81 a 2020-09-10 00:10:00
4 2020-09-10 01:00:00 222 a 2020-09-10 01:31:00
5 2020-09-10 02:00:00 144 c 2020-09-10 02:01:00
6 2020-09-10 03:00:00 84 b 2020-09-10 03:01:00
7 2020-09-10 04:00:00 86 c 2020-09-10 04:01:00
8 2020-09-10 05:00:00 87 c NaT
9 2020-09-10 06:00:00 7 c 2020-09-10 06:05:00
21:00:00 12
来自第一个源行(您忘记了
这是预期的结果)
00:10:00A22
的“分区”,只是
预期结果
2020-09-09 21:00:00 12
2020-09-09 21:01:00 7
2020-09-09 22:00:00 7
2020-09-09 23:00:00 7
2020-09-10 00:00:00 1
2020-09-10 00:10:00 80
2020-09-10 01:00:00 50
2020-09-10 01:31:00 60
2020-09-10 01:50:00 50
2020-09-10 01:51:00 26
2020-09-10 01:59:00 36
2020-09-10 02:00:00 36
2020-09-10 02:01:00 51
2020-09-10 02:51:00 57
2020-09-10 03:00:00 6
2020-09-10 03:01:00 78
2020-09-10 04:00:00 1
2020-09-10 04:01:00 85
2020-09-10 05:00:00 87
2020-09-10 06:00:00 7
dtype: int32
2020-09-09 21:00:00 19
2020-09-09 22:00:00 7
2020-09-09 23:00:00 7
2020-09-10 00:00:00 81
2020-09-10 01:00:00 222
2020-09-10 02:00:00 144
2020-09-10 03:00:00 84
2020-09-10 04:00:00 86
2020-09-10 05:00:00 87
2020-09-10 06:00:00 7
Freq: H, Name: produced, dtype: int32
ts produced event tsMin
0 2020-09-09 21:00:00 19 a 2020-09-09 21:01:00
1 2020-09-09 22:00:00 7 a NaT
2 2020-09-09 23:00:00 7 a NaT
3 2020-09-10 00:00:00 81 a 2020-09-10 00:10:00
4 2020-09-10 01:00:00 222 a 2020-09-10 01:31:00
5 2020-09-10 02:00:00 144 c 2020-09-10 02:01:00
6 2020-09-10 03:00:00 84 b 2020-09-10 03:01:00
7 2020-09-10 04:00:00 86 c 2020-09-10 04:01:00
8 2020-09-10 05:00:00 87 c NaT
9 2020-09-10 06:00:00 7 c 2020-09-10 06:05:00
例如,00:00:00的值81是1和80的总和(第一个)
第130行产生的零件),见上文prodDet
tsMin列中的某些值为空,表示没有时间限制的小时数
源行
如果要完全删除第一行的结果(使用
持续时间==12),将返回pd.Series([lstRow.duration]…
更改为
返回pd.Series([0]…
(零件函数的第4行)
总而言之,我的解决方案更具泛泛之谈,而且要短得多
比您的(17行(我的解决方案)和大约70行(您的),不包括评论).为什么2020-09-10 00:00:00的值不等于81?它当然应该匹配。
3 2020-09-10 00:00:00:00 10.0 1.0 0 0 2020-09-10 00:00:00 50.0 80.0 1 2020-09-10 01:00 31.0 50.0 0 0 2020-09-10 01:00:00 19.0 60.0
原因是创建df的代码示例与打印输出之间存在差异(在上面)。我在你的帖子的评论中指出了这一点。你的代码包含带“a”的行:130,“b”:“a”和“…”,“ts”:“2020-09-10 00:31:00”(我只使用了这段代码)。所以这个源代码行(带130)关于00:10:00和00:31:00之间的活动,在我的中间结果中显示为2020-09-10 00:10:00 130。从代码示例创建的df开始,然后检查详细信息。谢谢!@Valdi_-Bo。我已经更改了df定义。你能重新验证你的答案并更新我将接受的列名帖子吗?我为你感到高兴r帮助解决此问题。此外,如果您可以在所有时间内添加分钟,可能会帮助其他用户!我更正了我的解决方案,还添加了tsMin列(所有时间内的分钟)。谢谢,我希望得到每小时的持续时间(以分钟为单位),而不是您给出的持续时间。您可以添加吗?源数据框的打印输出与创建此数据框的代码不一致。例如,在第3行中,您的代码包含00:31:00,而打印输出-01:31:00。另一个差异是代码包含a和b列,but生成的打印输出和事件。还要注意,代码中不需要c列。
result['tsMin'] = df.duration.resample('H').apply(lambda grp: grp.index.min())
result = result.reset_index().rename(columns={'index': 'ts'})
ts produced event tsMin
0 2020-09-09 21:00:00 19 a 2020-09-09 21:01:00
1 2020-09-09 22:00:00 7 a NaT
2 2020-09-09 23:00:00 7 a NaT
3 2020-09-10 00:00:00 81 a 2020-09-10 00:10:00
4 2020-09-10 01:00:00 222 a 2020-09-10 01:31:00
5 2020-09-10 02:00:00 144 c 2020-09-10 02:01:00
6 2020-09-10 03:00:00 84 b 2020-09-10 03:01:00
7 2020-09-10 04:00:00 86 c 2020-09-10 04:01:00
8 2020-09-10 05:00:00 87 c NaT
9 2020-09-10 06:00:00 7 c 2020-09-10 06:05:00