Python 在第二个数据帧内的值上连接两个数据帧_Python_Python 3.x_Pandas_Join_Dataframe

Python 在第二个数据帧内的值上连接两个数据帧

python python-3.x pandas join dataframe

Python 在第二个数据帧内的值上连接两个数据帧,python,python-3.x,pandas,join,dataframe,Python,Python 3.x,Pandas,Join,Dataframe,我正在尝试将数据集中的两个值的数据帧连接起来： df1 t0 t1 text0 text1 ID 2133 7.0 3.0 NaN NaN 1234 10.0 8.0 NaN NaN 7352 9.0 7.0 NaN NaN 2500 7.0 6.0 NaN NaN 3298

我正在尝试将数据集中的两个值的数据帧连接起来：

df1     t0      t1      text0   text1
ID                                  
2133    7.0     3.0     NaN     NaN
1234    10.0    8.0     NaN     NaN
7352    9.0     7.0     NaN     NaN
2500    7.0     6.0     NaN     NaN
3298    10.0    8.0     NaN     NaN

df1（见上图）

和df2（见上图）

我正在尝试组合这两个数据帧，以便将df1中的NAN替换为df2中的文本。如您所见，我们通过将ID与t0或t1中的分数相匹配来获得文本。理想情况下，它看起来像这样：

 df1     t0     t1      text0   text1
ID                                  
2133    7.0     3.0     asdf    qwer
1234    10.0    8.0     pois    zzzz
7352    9.0     7.0     ijsd    bdcs
2500    7.0     6.0     cccc    erer
3298    10.0    8.0     swed    ytyt

我试图使用pd.merge-do来加入，但一直没有取得任何进展。谢谢你的帮助

您可以使用first对空列

text0

和

text1

进行重塑：

df = pd.melt(df1.drop(['text0','text1'], axis=1), id_vars='ID', value_name='score')
print (df)
     ID variable  score
0  2133       t0    7.0
1  1234       t0   10.0
2  7352       t0    9.0
3  2500       t0    7.0
4  3298       t0   10.0
5  2133       t1    3.0
6  1234       t1    8.0
7  7352       t1    7.0
8  2500       t1    6.0
9  3298       t1    8.0

然后通过内部连接（参数

how='internal'

默认情况下是省略的，因此它是省略的），并且在=['ID'，'score']上也是省略的，因为在这两个

数据帧中，只有这两列是常见的：
df = pd.merge(df2, df)
print (df)
     ID  score text_org variable
0  2133    7.0     asdf       t0
1  2500    7.0     cccc       t0
2  3298    8.0     ytyt       t1
3  2133    3.0     qwer       t1
4  1234   10.0     pois       t0
5  7352    9.0     ijsd       t0
6  7352    7.0     bdcs       t1
7  3298   10.0     swed       t0
8  1234    8.0     zzzz       t1
9  2500    6.0     erer       t1

最后一次按重新整形，并按设置列名，不带第一列（[1://code>）：
按注释编辑：
你会得到：
ValueError:索引包含重复的条目，无法重塑
问题是，如果df2
列ID
和score
有重复项
e、 g.将新行添加到末尾，它与第一行具有相同的ID
和score
（2133
和7.0
）-因此获得重复项：
print (df2)
      ID  score text_org
0   2133    7.0     asdf
1   2500    7.0     cccc
2   3298    8.0     ytyt
3   2133    3.0     qwer
4   1234   10.0     pois
5   7352    9.0     ijsd
6   7352    7.0     bdcs
7   3298   10.0     swed
8   1234    8.0     zzzz
9   2500    6.0     erer
10  2133    7.0  new_val

合并后，您可以检查第一列和第二列-对于相同的ID
和score
您可以得到两个值-asdf
和new\u val
，因此得到错误：
df = pd.merge(df2, df)
print (df)
      ID  score text_org variable
0   2133    7.0     asdf       t0
1   2133    7.0  new_val       t0
2   2500    7.0     cccc       t0
3   3298    8.0     ytyt       t1
4   2133    3.0     qwer       t1
5   1234   10.0     pois       t0
6   7352    9.0     ijsd       t0
7   7352    7.0     bdcs       t1
8   3298   10.0     swed       t0
9   1234    8.0     zzzz       t1
10  2500    6.0     erer       t1

解决方案是使用一些聚合功能或删除df2中的重复项（例如使用）：
@jezrael-我收到了错误-“ValueError:索引包含重复的条目，无法重塑”任何建议？是的，df2
中的ID
列和score列存在重复项。
print (df2)
      ID  score text_org
0   2133    7.0     asdf
1   2500    7.0     cccc
2   3298    8.0     ytyt
3   2133    3.0     qwer
4   1234   10.0     pois
5   7352    9.0     ijsd
6   7352    7.0     bdcs
7   3298   10.0     swed
8   1234    8.0     zzzz
9   2500    6.0     erer
10  2133    7.0  new_val

df = pd.merge(df2, df)
print (df)
      ID  score text_org variable
0   2133    7.0     asdf       t0
1   2133    7.0  new_val       t0
2   2500    7.0     cccc       t0
3   3298    8.0     ytyt       t1
4   2133    3.0     qwer       t1
5   1234   10.0     pois       t0
6   7352    9.0     ijsd       t0
7   7352    7.0     bdcs       t1
8   3298   10.0     swed       t0
9   1234    8.0     zzzz       t1
10  2500    6.0     erer       t1

#aggregate function is first
df3 = df.pivot_table(index='ID', columns='variable', aggfunc='first')
df3.columns = df1.columns[1:]
print (df3)
      t0 t1 text0 text1
ID                     
1234  10  8  pois  zzzz
2133   7  3  asdf  qwer
2500   7  6  cccc  erer
3298  10  8  swed  ytyt
7352   9  7  ijsd  bdcs

#aggregate function is last
df4 = df.pivot_table(index='ID', columns='variable', aggfunc='last')
df4.columns = df1.columns[1:]
print (df4)
      t0 t1    text0 text1
ID                        
1234  10  8     pois  zzzz
2133   7  3  new_val  qwer
2500   7  6     cccc  erer
3298  10  8     swed  ytyt
7352   9  7     ijsd  bdcs