Python 如何在熊猫数据集上进行组合和分组计算?
我正在写一篇经济学论文,需要一些关于合并和转换两个数据集的帮助 我有两个熊猫数据框,其中一个包含国家及其邻国(borderdf)的列表,例如 每个国家和年份的数据(datadf),如Python 如何在熊猫数据集上进行组合和分组计算?,python,pandas,economics,Python,Pandas,Economics,我正在写一篇经济学论文,需要一些关于合并和转换两个数据集的帮助 我有两个熊猫数据框,其中一个包含国家及其邻国(borderdf)的列表,例如 每个国家和年份的数据(datadf),如 datadf country gdp year sweden 5454 2004 sweden 5676 2005 norway 3433 2004 norway 3433 2005 denmark 2132 2004 denmark 23
datadf
country gdp year
sweden 5454 2004
sweden 5676 2005
norway 3433 2004
norway 3433 2005
denmark 2132 2004
denmark 2342 2005
我需要在datadf中为NeighterMeangDP创建一列,该列将包含所有邻居gdp的平均值,如NeighterDF所示。我希望我的结果如下所示:
datadf
country year gdp neighborsmeangdp
sweden 2004 5454 5565
sweden 2005 5676 5775
我应该怎么做呢?我认为一个直接的方法是将GDP值放在
边框中。然后,只需对groupby
对象进行求和
,然后进行合并
:
In [178]:
borderdf[2004]=[datadf2.ix[(item, 2004)].values[0] for item in borderdf.neighbor]
borderdf[2005]=[datadf2.ix[(item, 2005)].values[0] for item in borderdf.neighbor]
gpdf=borderdf.groupby(by=['country']).sum()
df=pd.DataFrame(gpdf.unstack(), columns=['neighborsmeangdp'])
df=df.reset_index()
df=df.rename(columns = {'level_0':'year'})
print pd.ordered_merge(datadf, df)
country gdp year neighborsmeangdp
0 denmark 2132 2004 7586
1 germany 2132 2004 NaN
2 norway 3433 2004 NaN
3 sweden 5454 2004 5565
4 denmark 2342 2005 8018
5 germany 2342 2005 NaN
6 norway 3433 2005 NaN
7 sweden 5676 2005 5775
[8 rows x 4 columns]
当然,我得为德国准备一些数据
germany 2132 2004
germany 2342 2005
我相信实际上她做得更好。您可以使用pandasmerge
函数直接将两者合并。
这里的诀窍是,您实际上想要将datadf
中的country列与borderdf
中的邻居列合并。
然后使用groupby
和mean
获得平均邻居gdp。
最后,与数据合并得到该国自己的GDP。
例如:
import pandas as pd
from StringIO import StringIO
border_csv = '''
country, neighbor
sweden, norway
sweden, denmark
denmark, germany
denmark, sweden
'''
data_csv = '''
country, gdp, year
sweden, 5454, 2004
sweden, 5676, 2005
norway, 3433, 2004
norway, 3433, 2005
denmark, 2132, 2004
denmark, 2342, 2005
'''
borders = pd.read_csv(StringIO(border_csv), sep=',\s*', header=1)
data = pd.read_csv(StringIO(data_csv), sep=',\s*', header=1)
merged = pd.merge(borders,data,left_on='neighbor',right_on='country')
merged = merged.drop('country_y', axis=1)
merged.columns = ['country','neighbor','gdp','year']
grouped = merged.groupby(['country','year'])
neighbor_means = grouped.mean()
neighbor_means.columns = ['neighbor_gdp']
neighbor_means.reset_index(inplace=True)
results_df = pd.merge(neighbor_means,data, on=['country','year'])
为什么这个问题被认为是代码<板>代码,有人愿意解释吗?我也不明白为什么它太宽了。我认为标题中的问题措辞过于宽泛,但问题本身非常具体。OP给出了示例输入和示例输出。因为描述您的需求和要求他人为您编写代码或解释如何编写代码的问题被认为是堆栈溢出的主题之外的问题,但标准的结束原因都不适用。有些人似乎认为“太宽泛”、“不清楚你在问什么”或“缺乏足够的信息来诊断问题”总是足以涵盖这类问题,但这一案例说明了为什么他们常常无法理解正确的信息。啊,好吧。所以这有点“因为提问者的努力不够而关闭”?
import pandas as pd
from StringIO import StringIO
border_csv = '''
country, neighbor
sweden, norway
sweden, denmark
denmark, germany
denmark, sweden
'''
data_csv = '''
country, gdp, year
sweden, 5454, 2004
sweden, 5676, 2005
norway, 3433, 2004
norway, 3433, 2005
denmark, 2132, 2004
denmark, 2342, 2005
'''
borders = pd.read_csv(StringIO(border_csv), sep=',\s*', header=1)
data = pd.read_csv(StringIO(data_csv), sep=',\s*', header=1)
merged = pd.merge(borders,data,left_on='neighbor',right_on='country')
merged = merged.drop('country_y', axis=1)
merged.columns = ['country','neighbor','gdp','year']
grouped = merged.groupby(['country','year'])
neighbor_means = grouped.mean()
neighbor_means.columns = ['neighbor_gdp']
neighbor_means.reset_index(inplace=True)
results_df = pd.merge(neighbor_means,data, on=['country','year'])