Python 通过键连接数据帧-重复数据作为新列
我面临着下一种情况。我有两个数据帧,比如说df1和df2,我需要通过一个键(ID_ed,ID)连接它们。第二个数据帧可能有多个键出现,我需要的是连接两个数据帧,并将重复出现的键添加为新列(如下图所示) 我尝试使用Python 通过键连接数据帧-重复数据作为新列,python,pandas,dataframe,Python,Pandas,Dataframe,我面临着下一种情况。我有两个数据帧,比如说df1和df2,我需要通过一个键(ID_ed,ID)连接它们。第二个数据帧可能有多个键出现,我需要的是连接两个数据帧,并将重复出现的键添加为新列(如下图所示) 我尝试使用merge=df2.join(df1,lsuffix=''uzid',rsuffix=''uiid',how=“left”)和concat操作,但到目前为止运气不佳。它似乎只保留了最后一次出现的内容(好像它正在覆盖数据) 在此,我们非常感谢您的帮助,并提前表示感谢 我会使用cumcoun
merge=df2.join(df1,lsuffix=''uzid',rsuffix=''uiid',how=“left”)
和concat操作,但到目前为止运气不佳。它似乎只保留了最后一次出现的内容(好像它正在覆盖数据)
在此,我们非常感谢您的帮助,并提前表示感谢
我会使用cumcount和pivot\u表:
In [11]: df1
Out[11]:
ID color
0 1 5
1 2 8
2 3 7
In [12]: df2
Out[12]:
ID code
0 1 1.0
1 1 5.0
2 2 NaN
3 2 20.0
4 2 74.0
In [13]: res = df1.merge(df2) # This is a merge if the column names match
In [14]: res
Out[14]:
ID color code
0 1 5 1.0
1 1 5 5.0
2 2 8 NaN
3 2 8 20.0
4 2 8 74.0
In [15]: res['count'] = res.groupby('ID').cumcount()
In [16]: res.pivot_table('code', ['ID', 'color'], 'count')
Out[16]:
count 0 1 2
ID color
1 5 1.0 5.0 NaN
2 8 NaN 20.0 74.0
另一种方法是在调用pivot\u表之前,set\u index
和unstack
。pivot\u表
聚合将是第一个
。这种方法将相当类似于
生成数据
import pandas as pd
import numpy as np
a = [['ID_ed','color'],[1,5],[2,8],[3,7]]
b = [['ID','code'],[1,1],[1,5],
[2,np.nan],[2,20],[2,74],
[3,10],[3,98],[3,85],
[3,21],[3,45]]
df1 = pd.DataFrame(a[1:], columns=a[0])
df2 = pd.DataFrame(b[1:], columns=b[0])
print(df1)
ID_ed color
0 1 5
1 2 8
2 3 7
print(df2)
ID code
0 1 1.0
1 1 5.0
2 2 NaN
3 2 20.0
4 2 74.0
5 3 10.0
6 3 98.0
7 3 85.0
8 3 21.0
9 3 45.0
首先是合并和取消堆栈
# Merge and add a serial counter column
df = df1.merge(df2, how='inner', left_on='ID_ed', right_on='ID')
df['counter'] = df.groupby('ID_ed').cumcount()+1
print(df)
ID_ed color ID code counter
0 1 5 1 1.0 1
1 1 5 1 5.0 2
2 2 8 2 NaN 1
3 2 8 2 20.0 2
4 2 8 2 74.0 3
5 3 7 3 10.0 1
6 3 7 3 98.0 2
7 3 7 3 85.0 3
8 3 7 3 21.0 4
9 3 7 3 45.0 5
# Set index and unstack
df.set_index(['ID_ed','color','counter']).\
unstack().\
swaplevel(1,0,axis=1).\
sort_index(level=0,axis=1).add_prefix('counter_')
print(df)
counter counter_1 counter_2 \
counter_ID counter_code counter_ID counter_code\
ID_ed color \
1 5 1.0 1.0 1.0 5.0\
2 8 2.0 NaN 2.0 20.0\
3 7 3.0 10.0 3.0 98.0 \
counter_3 counter_4 counter_5
counter_ID counter_code counter_ID counter_code counter_ID counter_code
NaN NaN NaN NaN NaN NaN
2.0 74.0 NaN NaN NaN NaN
3.0 85.0 3.0 21.0 3.0 45.0
接下来生成透视表
# Pivot table with 'first' aggregation
dfp = pd.pivot_table(df, index=['ID_ed','color'],
columns=['counter'],
values=['ID', 'code'],
aggfunc='first')
print(dfp)
ID code
counter 1 2 3 4 5 1 2 3 4 5
ID_ed color
1 5 1.0 1.0 NaN NaN NaN 1.0 5.0 NaN NaN NaN
2 8 2.0 2.0 2.0 NaN NaN NaN 20.0 74.0 NaN NaN
3 7 3.0 3.0 3.0 3.0 3.0 10.0 98.0 85.0 21.0 45.0
最后重命名列并按部分列名切片
# Rename columns
level_1_names = list(dfp.columns.get_level_values(1))
level_0_names = list(dfp.columns.get_level_values(0))
new_cnames = [b+'_'+str(f) for f, b in zip(level_1_names, level_0_names)]
dfp.columns = new_cnames
# Slice by new column names
print(dfp.loc[:, dfp.columns.str.contains('code')].reset_index(drop=False))
ID_ed color code_1 code_2 code_3 code_4 code_5
0 1 5 1.0 5.0 NaN NaN NaN
1 2 8 NaN 20.0 74.0 NaN NaN
2 3 7 10.0 98.0 85.0 21.0 45.0
请使用打印(df.to_string())
的输出,而不是电子表格的屏幕截图。