Python 探索熊猫到群比的捷径&;连接而不创建中间数据帧
在我尝试缩减代码时,我注意到我经常需要groupby,然后将分组数据帧连接到父数据帧。有没有什么方法可以编写更优雅、更简洁的代码?请检查我下面的示例: 假设此父数据帧:(您可以复制并粘贴以播放) 示例:下面描述了一个简单的Python 探索熊猫到群比的捷径&;连接而不创建中间数据帧,python,pandas,join,pandas-groupby,aggregate-functions,Python,Pandas,Join,Pandas Groupby,Aggregate Functions,在我尝试缩减代码时,我注意到我经常需要groupby,然后将分组数据帧连接到父数据帧。有没有什么方法可以编写更优雅、更简洁的代码?请检查我下面的示例: 假设此父数据帧:(您可以复制并粘贴以播放) 示例:下面描述了一个简单的groupby和join,它将填充“sum\u 24\u A”列,该列汇总每个日历日的小时总和值。然而,我最近发现,我可以通过应用代码的第二部分来缩短这个过程,它填充了“sum_24_b”列 # first part: create a dataframe and then j
groupby
和join
,它将填充“sum\u 24\u A”列,该列汇总每个日历日的小时总和值。然而,我最近发现,我可以通过应用代码的第二部分来缩短这个过程,它填充了“sum_24_b”列
# first part: create a dataframe and then join to get column 'sum_24_a'
frame_sum=frame_total.groupby(frame_total.index.date)['sum_hourly'].sum().to_frame('sum_24_a')
frame_total=frame_total.join(frame_sum)
# second part: directly create column 'sum_24_b' without the need of joining
frame_total['sum_24_b']=frame_total['sum_hourly'].groupby(frame_total.index.date).sum()
print(frame_total)
chem_1 chem_2 sum_hourly sum_24_a sum_24_b
2018-01-01 00:00:00 -5.0 6.0 1.0 8.0 8.0
2018-01-01 06:00:00 9.0 -1.0 8.0 NaN NaN
2018-01-01 12:00:00 -1.0 -4.0 -5.0 NaN NaN
2018-01-01 18:00:00 4.0 NaN 4.0 NaN NaN
2018-01-02 00:00:00 -2.0 -7.0 -9.0 -2.0 -2.0
2018-01-02 06:00:00 3.0 -5.0 -2.0 NaN NaN
2018-01-02 12:00:00 4.0 5.0 9.0 NaN NaN
2018-01-02 18:00:00 NaN NaN NaN NaN NaN
2018-01-03 00:00:00 NaN 10.0 10.0 23.0 23.0
2018-01-03 06:00:00 NaN -9.0 -9.0 NaN NaN
2018-01-03 12:00:00 8.0 8.0 16.0 NaN NaN
2018-01-03 18:00:00 NaN 6.0 6.0 NaN NaN
2018-01-04 00:00:00 9.0 NaN 9.0 9.0 9.0
问题:是否有类似建议,执行以下更复杂的GROUPBY、AGG和JOIN,而无需创建“frame\u day”数据帧,然后将其加入原始数据帧,如下所示
frame_day=frame_total.between_time('10:00:00', '16:00:00').\
groupby(frame_total.between_time('10:00:00', '16:00:00').index.date)['sum_hourly'].\
agg([('sum_day', lambda x: x.sum()), \
('positive_sum_day', lambda x: x[x>0].sum()), \
('negative_sum_day', lambda x: x[x<0].sum())])
frame_total=frame_total.join(frame_day)
print(frame_total.head(8))
chem_1 chem_2 sum_hourly sum_24_a sum_24_b \
2018-01-01 00:00:00 -5.0 6.0 1.0 8.0 8.0
2018-01-01 06:00:00 9.0 -1.0 8.0 NaN NaN
2018-01-01 12:00:00 -1.0 -4.0 -5.0 NaN NaN
2018-01-01 18:00:00 4.0 NaN 4.0 NaN NaN
2018-01-02 00:00:00 -2.0 -7.0 -9.0 -2.0 -2.0
2018-01-02 06:00:00 3.0 -5.0 -2.0 NaN NaN
2018-01-02 12:00:00 4.0 5.0 9.0 NaN NaN
2018-01-02 18:00:00 NaN NaN NaN NaN NaN
sum_day positive_sum_day negative_sum_day
2018-01-01 00:00:00 -5.0 0.0 -5.0
2018-01-01 06:00:00 NaN NaN NaN
2018-01-01 12:00:00 NaN NaN NaN
2018-01-01 18:00:00 NaN NaN NaN
2018-01-02 00:00:00 9.0 9.0 0.0
2018-01-02 06:00:00 NaN NaN NaN
2018-01-02 12:00:00 NaN NaN NaN
2018-01-02 18:00:00 NaN NaN NaN
frame\u day=frame\u总计。时间间隔('10:00:00','16:00:00')\
groupby(总帧时间间隔('10:00:00','16:00:00')。索引日期)['sum\u hourly']\
agg([('sum_day',lambda x:x.sum())\
('positive_sum_day',lambda x:x[x>0].sum())\
('negative_sum_day',lambda x:x[x关于第一个问题,这里有一个解决方案。如果不需要,可以稍后删除日期列
frame_total['date'] = frame_total.index.date
frame_total['sum_24_a'] = frame_total.groupby('date')['sum_hourly'].sum()
print(frame_total)
chem_1 chem_2 sum_hourly date sum_24_a
2018-01-01 00:00:00 -5.0 6.0 1.0 2018-01-01 8.0
2018-01-01 06:00:00 9.0 -1.0 8.0 2018-01-01 NaN
2018-01-01 12:00:00 -1.0 -4.0 -5.0 2018-01-01 NaN
2018-01-01 18:00:00 4.0 NaN 4.0 2018-01-01 NaN
2018-01-02 00:00:00 -2.0 -7.0 -9.0 2018-01-02 -2.0
2018-01-02 06:00:00 3.0 -5.0 -2.0 2018-01-02 NaN
2018-01-02 12:00:00 4.0 5.0 9.0 2018-01-02 NaN
2018-01-02 18:00:00 NaN NaN NaN 2018-01-02 NaN
2018-01-03 00:00:00 NaN 10.0 10.0 2018-01-03 23.0
2018-01-03 06:00:00 NaN -9.0 -9.0 2018-01-03 NaN
2018-01-03 12:00:00 8.0 8.0 16.0 2018-01-03 NaN
2018-01-03 18:00:00 NaN 6.0 6.0 2018-01-03 NaN
2018-01-04 00:00:00 9.0 NaN 9.0 2018-01-04 9.0
关于第二个问题,这里有一个生成“sum_day”列的简单方法。其他列的构建方式可能与此相同:
frame_total['sum_day'] = frame_total.loc[
frame_total.between_time('10:00:00', '16:00:00').index] \
.groupby('date')['sum_hourly'].agg('sum')
print(frame_total.head(8))
chem_1 chem_2 sum_hourly date sum_24_a sum_day
2018-01-01 00:00:00 -5.0 6.0 1.0 2018-01-01 8.0 -5.0
2018-01-01 06:00:00 9.0 -1.0 8.0 2018-01-01 NaN NaN
2018-01-01 12:00:00 -1.0 -4.0 -5.0 2018-01-01 NaN NaN
2018-01-01 18:00:00 4.0 NaN 4.0 2018-01-01 NaN NaN
2018-01-02 00:00:00 -2.0 -7.0 -9.0 2018-01-02 -2.0 9.0
2018-01-02 06:00:00 3.0 -5.0 -2.0 2018-01-02 NaN NaN
2018-01-02 12:00:00 4.0 5.0 9.0 2018-01-02 NaN NaN
2018-01-02 18:00:00 NaN NaN NaN 2018-01-02 NaN NaN
关于第一个问题,这里有一个解决方案。如果不需要,您可以稍后删除日期列
frame_total['date'] = frame_total.index.date
frame_total['sum_24_a'] = frame_total.groupby('date')['sum_hourly'].sum()
print(frame_total)
chem_1 chem_2 sum_hourly date sum_24_a
2018-01-01 00:00:00 -5.0 6.0 1.0 2018-01-01 8.0
2018-01-01 06:00:00 9.0 -1.0 8.0 2018-01-01 NaN
2018-01-01 12:00:00 -1.0 -4.0 -5.0 2018-01-01 NaN
2018-01-01 18:00:00 4.0 NaN 4.0 2018-01-01 NaN
2018-01-02 00:00:00 -2.0 -7.0 -9.0 2018-01-02 -2.0
2018-01-02 06:00:00 3.0 -5.0 -2.0 2018-01-02 NaN
2018-01-02 12:00:00 4.0 5.0 9.0 2018-01-02 NaN
2018-01-02 18:00:00 NaN NaN NaN 2018-01-02 NaN
2018-01-03 00:00:00 NaN 10.0 10.0 2018-01-03 23.0
2018-01-03 06:00:00 NaN -9.0 -9.0 2018-01-03 NaN
2018-01-03 12:00:00 8.0 8.0 16.0 2018-01-03 NaN
2018-01-03 18:00:00 NaN 6.0 6.0 2018-01-03 NaN
2018-01-04 00:00:00 9.0 NaN 9.0 2018-01-04 9.0
关于第二个问题,这里有一个生成“sum_day”列的简单方法。其他列的构建方式可能与此相同:
frame_total['sum_day'] = frame_total.loc[
frame_total.between_time('10:00:00', '16:00:00').index] \
.groupby('date')['sum_hourly'].agg('sum')
print(frame_total.head(8))
chem_1 chem_2 sum_hourly date sum_24_a sum_day
2018-01-01 00:00:00 -5.0 6.0 1.0 2018-01-01 8.0 -5.0
2018-01-01 06:00:00 9.0 -1.0 8.0 2018-01-01 NaN NaN
2018-01-01 12:00:00 -1.0 -4.0 -5.0 2018-01-01 NaN NaN
2018-01-01 18:00:00 4.0 NaN 4.0 2018-01-01 NaN NaN
2018-01-02 00:00:00 -2.0 -7.0 -9.0 2018-01-02 -2.0 9.0
2018-01-02 06:00:00 3.0 -5.0 -2.0 2018-01-02 NaN NaN
2018-01-02 12:00:00 4.0 5.0 9.0 2018-01-02 NaN NaN
2018-01-02 18:00:00 NaN NaN NaN 2018-01-02 NaN NaN
使用groupby.transform
保留原始索引hanks@datanoveler,我在这里找到了这个链接(),我可以看到一个很好的例子。如果我设法将它应用到这个数据集,我会发布它。否则,请随意发布答案。使用groupby.transform
保留原始索引hanks@datanoveler,我找到了这个链接()在这里我可以看到一个很好的例子。如果我在这里成功地将其应用于此数据集,我将发布它。否则,请随意发布答案。