Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/286.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 合并返回奇数长度_Python_Python 3.x_Pandas - Fatal编程技术网

Python 合并返回奇数长度

Python 合并返回奇数长度,python,python-3.x,pandas,Python,Python 3.x,Pandas,我有一个相对简单的任务问题 我有两个数据帧: df_样本我从csv中读取 +------+-----------+-------+-----------+ | key | Full Text | Date | Publisher | +------+-----------+-------+-----------+ | abcd | foofoo | date1 | a | | bcde | barbar | date2 | b | | cdef |

我有一个相对简单的任务问题

我有两个数据帧:
df_样本
我从csv中读取

+------+-----------+-------+-----------+
| key  | Full Text | Date  | Publisher |
+------+-----------+-------+-----------+
| abcd | foofoo    | date1 | a         |
| bcde | barbar    | date2 | b         |
| cdef | foobar    | date3 | c         |
+------+-----------+-------+-----------+

len(df_sample) = 20000
df_标签
我从excel中读取

+------+----------+--------+--------+
| key  | relevant | other  | other2 |
+------+----------+--------+--------+
| abcd | yes      | blabla | blabla |
| bcde | no       | blabla | blabla |
| cdef | no       | blabla | blabla |
| defg | yes      | blabla | blabla |
+------+----------+--------+--------+

len(df_labels) = 219000
我想加入两个表,为第一个数据帧中的每个键分配
相关的
值。所需的输出如下所示:

+------+-----------+-------+-----------+----------+
| key  | Full Text | Date  | Publisher | relevant |
+------+-----------+-------+-----------+----------+
| abcd | foofoo    | date1 | a         | yes      |
| bcde | barbar    | date2 | b         | no       |
| cdef | foobar    | date3 | c         | no       |
+------+-----------+-------+-----------+----------+
我似乎做到了这一点,但为什么下面给出的结果是27377而不是20000(如原来的左表所示):


您会看到更多的行,因为键在两个df中不是唯一的,在您的示例中是第二个df。您需要决定是要重复当前行为的行,还是要在第二个df中删除重复的行:

df_labels = df_labels.drop_duplicates(subset='key')

默认情况下,这将只保留第一个重复项,如果您想要保留最后一个这样的替代行为,那么您可以传递:
keep='last'
查看

是否检查键列值在第二个df中是否唯一,如果重复,则会得到重复的行,另外,在任一关键列中都有
NaN
?当然,在第二个df中有一些重复项。。。非常感谢你为我指明了正确的方向!
df_labels = df_labels.drop_duplicates(subset='key')