Python 熊猫：以唯一值为条件进行连接_Python_Pandas_Set_Concatenation

Python 熊猫：以唯一值为条件进行连接

python pandas

Python 熊猫：以唯一值为条件进行连接,python,pandas,set,concatenation,Python,Pandas,Set,Concatenation,我连接了两个数据帧，如下所示 part1 = pd.DataFrame({'id' :[100,200,300,400,500], 'amount': np.random.randn(5) }) part2 = pd.DataFrame({'id' :[700,100,800,500,300], 'amount': np.random.randn(5)

我连接了两个数据帧，如下所示

part1 = pd.DataFrame({'id'    :[100,200,300,400,500], 
                   'amount': np.random.randn(5)
                    })

part2 = pd.DataFrame({'id'    :[700,100,800,500,300], 
                   'amount': np.random.randn(5)
                    })

concatenated = pd.concat([part1, part2], axis=0)
     amount   id
0 -0.458653  100
1  2.172348  200
2  0.072494  300
3 -0.253939  400
4 -0.061866  500
0 -1.187505  700
1 -0.810784  100
2  0.321881  800
3 -1.935284  500
4 -1.351507  300

如果行

id

尚未出现在

part1

中，如何限制操作，使

part2

中的行仅包含在

连接的

中？在某种程度上，我希望将

id

列视为一个集合

是否可以在concat（）期间执行此操作，或者这更像是一个后处理步骤

本示例的期望输出为：

concatenated_desired
     amount   id
0 -0.458653  100
1  2.172348  200
2  0.072494  300
3 -0.253939  400
4 -0.061866  500
0 -1.187505  700
2  0.321881  800

计算id不在第1部分中

In [28]:
diff = part2.ix[~part2['id'].isin(part1['id'])]
diff

Out[28]:
     amount   id
0 -2.184038  700
2 -0.070749  800

现在是康卡特

In [29]:
concatenated = pd.concat([part1, diff], axis=0)
concatenated

Out[29]:
     amount   id
0 -2.240625  100
1 -0.348184  200
2  0.281050  300
3  0.082460  400
4 -0.045416  500
0 -2.184038  700
2 -0.070749  800

您也可以将其放在一行中：

concatenated = pd.concat([part1, part2.ix[~part2['id'].isin(part1['id'])]], axis=0)

如果您得到一个id为

的列，则将其用作索引。使用真实索引执行操作将使事情变得更容易。在这里，您可以使用combine\u first
执行您正在搜索的操作：
part1 = part1.set_index('id')

part2 = part2.set_index('id')

part1.combine_first(p2)
Out[38]: 
       amount
id           
100  1.685685
200 -1.895151
300 -0.804097
400  0.119948
500 -0.434062
700  0.215255
800 -0.031562

如果确实不需要获取该索引，请在以下情况下重置它：
part1.combine_first(p2).reset_index()
Out[39]: 
    id    amount
0  100  1.685685
1  200 -1.895151
2  300 -0.804097
3  400  0.119948
4  500 -0.434062
5  700  0.215255
6  800 -0.031562

在concat（）之后调用drop\u duplicates（）
：
我查了一下表，但还是不太确定。这是否确保保留给定id
的第一个匹配项（行）？是的，有一个take\u last参数：boolean，default False。取一行中观察到的最后一行。默认为第一行。因此，您可以选择保留哪一个，第一个还是最后一个。因此，take_last=False
（默认）意味着take_first？
part1 = pd.DataFrame({'id'    :[100,200,300,400,500], 
                   'amount': np.arange(5)
                    })

part2 = pd.DataFrame({'id'    :[700,100,800,500,300], 
                   'amount': np.random.randn(5)
                    })

concatenated = pd.concat([part1, part2], axis=0)
print concatenated.drop_duplicates(cols="id")