python pandas merge_asof groupby
我有一个合并的数据帧,如下所示:python pandas merge_asof groupby,python,pandas,dataframe,merge,Python,Pandas,Dataframe,Merge,我有一个合并的数据帧,如下所示: >>> merged_df.dtypes Jurisdiction object AdjustedVolume float64 EffectiveStartDate datetime64[ns] VintageYear int64 ProductType object Rate
>>> merged_df.dtypes
Jurisdiction object
AdjustedVolume float64
EffectiveStartDate datetime64[ns]
VintageYear int64
ProductType object
Rate float32
Obligation float32
Demand float64
Cost float64
dtype: object
以下groupby语句按辖区/年份返回正确的调整量值:
>>> merged_df.groupby(['Jurisdiction', 'VintageYear'])['AdjustedVolume'].sum()
包括ProductType时:
>>> merged_df.groupby(['Jurisdiction', 'VintageYear','ProductType'])['AdjustedVolume'].sum()
如果辖区仅包含一种产品类型,则“按年调整数量”是正确的,但对于具有两种或两种以上产品类型的任何辖区,调整数量将被拆分,以使其总和达到正确的值。我希望每一行都有总的调整量,但不清楚为什么要拆分
例如:
>>> merged_df.groupby(['Jurisdiction', 'VintageYear'])['AdjustedVolume'].sum()
Jurisdiction VintageYear AdjustedVolume
CA 2017 3.529964e+05
>>> merged_df.groupby(['Jurisdiction', 'VintageYear','ProductType'])['AdjustedVolume'].sum()
Jurisdiction VintageYear ProductType AdjustedVolume
CA 2017 Bucket1 7.584832e+04
CA 2017 Bucket2 1.308454e+05
CA 2017 Bucket3 1.463026e+05
我怀疑合并操作不正确:
>>> df1.dtypes
Jurisdiction object
ProductType object
VintageYear int64
EffectiveStartDate datetime64[ns]
Rate float32
Obligation float32
dtype: object
>>> df2.dtypes
Jurisdiction object
AdjustedVolume float64
EffectiveStartDate datetime64[ns]
VintageYear int64
dtype: object
因为df2没有ProductType字段,下面的合并将总卷拆分为每个辖区下的任何ProductType。我是否可以修改以下合并,使每个产品类型都具有总调整量
merged_df = pd.merge_asof(df2, df1, on='EffectiveStartDate', by=['Jurisdiction','VintageYear'])
您可以使用GROUPBY的两个版本并合并这两个表。 第一个表是一个带有ProductType的group by,它将按ProductType细分调整的音量
df = df.groupby(['Jurisdiction','VintageYear','ProductType']).agg({'AdjustedVolume':'sum'}).reset_index(drop = False)
然后创建另一个不包含ProductType的表(这是总金额的来源)
现在在两个表中创建一个ID列,以便合并正常工作
df['ID'] = df['Jurisdiction'].astype(str)+'_' +df['VintageYear'].astype(str)
df1['ID'] = df1['Jurisdiction'].astype(str)+'_'+ df1['VintageYear'].astype(str)
现在在IDs上合并以获得总的调整体积
df = pd.merge(df, df1, left_on = ['ID'], right_on = ['ID'], how = 'inner')
最后一步是清理列
df = df.rename(columns = {'AdjustedVolume_x':'AdjustedVolume',
'AdjustedVolume_y':'TotalAdjustedVolume',
'Jurisdiction_x':'Jurisdiction',
'VintageYear_x':'VintageYear'})
del df['Jurisdiction_y']
del df['VintageYear_y']
您的输出将如下所示:
还应考虑检索与其他记录内联的分组聚合,类似于SQL中的子查询聚合
grpdf = merged_df.groupby(['Jurisdiction', 'VintageYear','ProductType'])['AdjustedVolume']\
.sum().reset_index()
grpdf['TotalAdjVolume'] = merged_df.groupby(['Jurisdiction', 'ProductType'])['AdjustedVolume']\
.transform('sum')
两个轻微的样式调整:
drop=False
arg是reset_index()
的默认值,因此是多余的,您可以使用删除列。
grpdf = merged_df.groupby(['Jurisdiction', 'VintageYear','ProductType'])['AdjustedVolume']\
.sum().reset_index()
grpdf['TotalAdjVolume'] = merged_df.groupby(['Jurisdiction', 'ProductType'])['AdjustedVolume']\
.transform('sum')