Python 探索熊猫到群比的捷径&;连接而不创建中间数据帧

Python 探索熊猫到群比的捷径&;连接而不创建中间数据帧,python,pandas,join,pandas-groupby,aggregate-functions,Python,Pandas,Join,Pandas Groupby,Aggregate Functions,在我尝试缩减代码时,我注意到我经常需要groupby,然后将分组数据帧连接到父数据帧。有没有什么方法可以编写更优雅、更简洁的代码?请检查我下面的示例: 假设此父数据帧:(您可以复制并粘贴以播放) 示例:下面描述了一个简单的groupby和join,它将填充“sum\u 24\u A”列,该列汇总每个日历日的小时总和值。然而,我最近发现,我可以通过应用代码的第二部分来缩短这个过程,它填充了“sum_24_b”列 # first part: create a dataframe and then j

在我尝试缩减代码时,我注意到我经常需要groupby,然后将分组数据帧连接到父数据帧。有没有什么方法可以编写更优雅、更简洁的代码?请检查我下面的示例:

假设此父数据帧:(您可以复制并粘贴以播放)

示例:下面描述了一个简单的
groupby
join
,它将填充“sum\u 24\u A”列,该列汇总每个日历日的小时总和值。然而,我最近发现,我可以通过应用代码的第二部分来缩短这个过程,它填充了“sum_24_b”列

# first part: create a dataframe and then join to get column 'sum_24_a'
frame_sum=frame_total.groupby(frame_total.index.date)['sum_hourly'].sum().to_frame('sum_24_a')
frame_total=frame_total.join(frame_sum)

# second part: directly create column 'sum_24_b' without the need of joining
frame_total['sum_24_b']=frame_total['sum_hourly'].groupby(frame_total.index.date).sum()

print(frame_total)

                     chem_1  chem_2  sum_hourly  sum_24_a  sum_24_b
2018-01-01 00:00:00    -5.0     6.0         1.0       8.0       8.0
2018-01-01 06:00:00     9.0    -1.0         8.0       NaN       NaN
2018-01-01 12:00:00    -1.0    -4.0        -5.0       NaN       NaN
2018-01-01 18:00:00     4.0     NaN         4.0       NaN       NaN
2018-01-02 00:00:00    -2.0    -7.0        -9.0      -2.0      -2.0
2018-01-02 06:00:00     3.0    -5.0        -2.0       NaN       NaN
2018-01-02 12:00:00     4.0     5.0         9.0       NaN       NaN
2018-01-02 18:00:00     NaN     NaN         NaN       NaN       NaN
2018-01-03 00:00:00     NaN    10.0        10.0      23.0      23.0
2018-01-03 06:00:00     NaN    -9.0        -9.0       NaN       NaN
2018-01-03 12:00:00     8.0     8.0        16.0       NaN       NaN
2018-01-03 18:00:00     NaN     6.0         6.0       NaN       NaN
2018-01-04 00:00:00     9.0     NaN         9.0       9.0       9.0
问题:是否有类似建议,执行以下更复杂的GROUPBY、AGG和JOIN,而无需创建“frame\u day”数据帧,然后将其加入原始数据帧,如下所示

frame_day=frame_total.between_time('10:00:00', '16:00:00').\
          groupby(frame_total.between_time('10:00:00', '16:00:00').index.date)['sum_hourly'].\
          agg([('sum_day', lambda x: x.sum()), \
               ('positive_sum_day', lambda x: x[x>0].sum()), \
               ('negative_sum_day', lambda x: x[x<0].sum())])
frame_total=frame_total.join(frame_day)

print(frame_total.head(8))

                     chem_1  chem_2  sum_hourly  sum_24_a  sum_24_b  \
2018-01-01 00:00:00    -5.0     6.0         1.0       8.0       8.0   
2018-01-01 06:00:00     9.0    -1.0         8.0       NaN       NaN   
2018-01-01 12:00:00    -1.0    -4.0        -5.0       NaN       NaN   
2018-01-01 18:00:00     4.0     NaN         4.0       NaN       NaN   
2018-01-02 00:00:00    -2.0    -7.0        -9.0      -2.0      -2.0   
2018-01-02 06:00:00     3.0    -5.0        -2.0       NaN       NaN   
2018-01-02 12:00:00     4.0     5.0         9.0       NaN       NaN   
2018-01-02 18:00:00     NaN     NaN         NaN       NaN       NaN   

                     sum_day  positive_sum_day  negative_sum_day  
2018-01-01 00:00:00     -5.0               0.0              -5.0  
2018-01-01 06:00:00      NaN               NaN               NaN  
2018-01-01 12:00:00      NaN               NaN               NaN  
2018-01-01 18:00:00      NaN               NaN               NaN  
2018-01-02 00:00:00      9.0               9.0               0.0  
2018-01-02 06:00:00      NaN               NaN               NaN  
2018-01-02 12:00:00      NaN               NaN               NaN  
2018-01-02 18:00:00      NaN               NaN               NaN  
frame\u day=frame\u总计。时间间隔('10:00:00','16:00:00')\
groupby(总帧时间间隔('10:00:00','16:00:00')。索引日期)['sum\u hourly']\
agg([('sum_day',lambda x:x.sum())\
('positive_sum_day',lambda x:x[x>0].sum())\

('negative_sum_day',lambda x:x[x关于第一个问题,这里有一个解决方案。如果不需要,可以稍后删除日期列

frame_total['date'] = frame_total.index.date
frame_total['sum_24_a'] = frame_total.groupby('date')['sum_hourly'].sum()
print(frame_total)

                     chem_1  chem_2  sum_hourly        date  sum_24_a
2018-01-01 00:00:00    -5.0     6.0         1.0  2018-01-01       8.0
2018-01-01 06:00:00     9.0    -1.0         8.0  2018-01-01       NaN
2018-01-01 12:00:00    -1.0    -4.0        -5.0  2018-01-01       NaN
2018-01-01 18:00:00     4.0     NaN         4.0  2018-01-01       NaN
2018-01-02 00:00:00    -2.0    -7.0        -9.0  2018-01-02      -2.0
2018-01-02 06:00:00     3.0    -5.0        -2.0  2018-01-02       NaN
2018-01-02 12:00:00     4.0     5.0         9.0  2018-01-02       NaN
2018-01-02 18:00:00     NaN     NaN         NaN  2018-01-02       NaN
2018-01-03 00:00:00     NaN    10.0        10.0  2018-01-03      23.0
2018-01-03 06:00:00     NaN    -9.0        -9.0  2018-01-03       NaN
2018-01-03 12:00:00     8.0     8.0        16.0  2018-01-03       NaN
2018-01-03 18:00:00     NaN     6.0         6.0  2018-01-03       NaN
2018-01-04 00:00:00     9.0     NaN         9.0  2018-01-04       9.0
关于第二个问题,这里有一个生成“sum_day”列的简单方法。其他列的构建方式可能与此相同:

frame_total['sum_day'] = frame_total.loc[
    frame_total.between_time('10:00:00', '16:00:00').index] \
    .groupby('date')['sum_hourly'].agg('sum')
print(frame_total.head(8))

                     chem_1  chem_2  sum_hourly        date  sum_24_a  sum_day
2018-01-01 00:00:00    -5.0     6.0         1.0  2018-01-01       8.0     -5.0
2018-01-01 06:00:00     9.0    -1.0         8.0  2018-01-01       NaN      NaN
2018-01-01 12:00:00    -1.0    -4.0        -5.0  2018-01-01       NaN      NaN
2018-01-01 18:00:00     4.0     NaN         4.0  2018-01-01       NaN      NaN
2018-01-02 00:00:00    -2.0    -7.0        -9.0  2018-01-02      -2.0      9.0
2018-01-02 06:00:00     3.0    -5.0        -2.0  2018-01-02       NaN      NaN
2018-01-02 12:00:00     4.0     5.0         9.0  2018-01-02       NaN      NaN
2018-01-02 18:00:00     NaN     NaN         NaN  2018-01-02       NaN      NaN

关于第一个问题,这里有一个解决方案。如果不需要,您可以稍后删除日期列

frame_total['date'] = frame_total.index.date
frame_total['sum_24_a'] = frame_total.groupby('date')['sum_hourly'].sum()
print(frame_total)

                     chem_1  chem_2  sum_hourly        date  sum_24_a
2018-01-01 00:00:00    -5.0     6.0         1.0  2018-01-01       8.0
2018-01-01 06:00:00     9.0    -1.0         8.0  2018-01-01       NaN
2018-01-01 12:00:00    -1.0    -4.0        -5.0  2018-01-01       NaN
2018-01-01 18:00:00     4.0     NaN         4.0  2018-01-01       NaN
2018-01-02 00:00:00    -2.0    -7.0        -9.0  2018-01-02      -2.0
2018-01-02 06:00:00     3.0    -5.0        -2.0  2018-01-02       NaN
2018-01-02 12:00:00     4.0     5.0         9.0  2018-01-02       NaN
2018-01-02 18:00:00     NaN     NaN         NaN  2018-01-02       NaN
2018-01-03 00:00:00     NaN    10.0        10.0  2018-01-03      23.0
2018-01-03 06:00:00     NaN    -9.0        -9.0  2018-01-03       NaN
2018-01-03 12:00:00     8.0     8.0        16.0  2018-01-03       NaN
2018-01-03 18:00:00     NaN     6.0         6.0  2018-01-03       NaN
2018-01-04 00:00:00     9.0     NaN         9.0  2018-01-04       9.0
关于第二个问题,这里有一个生成“sum_day”列的简单方法。其他列的构建方式可能与此相同:

frame_total['sum_day'] = frame_total.loc[
    frame_total.between_time('10:00:00', '16:00:00').index] \
    .groupby('date')['sum_hourly'].agg('sum')
print(frame_total.head(8))

                     chem_1  chem_2  sum_hourly        date  sum_24_a  sum_day
2018-01-01 00:00:00    -5.0     6.0         1.0  2018-01-01       8.0     -5.0
2018-01-01 06:00:00     9.0    -1.0         8.0  2018-01-01       NaN      NaN
2018-01-01 12:00:00    -1.0    -4.0        -5.0  2018-01-01       NaN      NaN
2018-01-01 18:00:00     4.0     NaN         4.0  2018-01-01       NaN      NaN
2018-01-02 00:00:00    -2.0    -7.0        -9.0  2018-01-02      -2.0      9.0
2018-01-02 06:00:00     3.0    -5.0        -2.0  2018-01-02       NaN      NaN
2018-01-02 12:00:00     4.0     5.0         9.0  2018-01-02       NaN      NaN
2018-01-02 18:00:00     NaN     NaN         NaN  2018-01-02       NaN      NaN

使用
groupby.transform
保留原始索引hanks@datanoveler,我在这里找到了这个链接(),我可以看到一个很好的例子。如果我设法将它应用到这个数据集,我会发布它。否则,请随意发布答案。使用
groupby.transform
保留原始索引hanks@datanoveler,我找到了这个链接()在这里我可以看到一个很好的例子。如果我在这里成功地将其应用于此数据集,我将发布它。否则,请随意发布答案。