Python/R:删除重复行-保留唯一的作者对
这是我从数据库中提取的一个示例。我在合作作者中使用可视化,所以基于此示例,我必须在两位作者中保持一种关系。例如,我必须删除布赖恩·诺顿中的一个---玛丽亚·鲁昂或玛丽亚·鲁昂---布赖恩·诺顿,以保持关系的唯一性Python/R:删除重复行-保留唯一的作者对,python,r,duplicates,mapping,data-manipulation,Python,R,Duplicates,Mapping,Data Manipulation,这是我从数据库中提取的一个示例。我在合作作者中使用可视化,所以基于此示例,我必须在两位作者中保持一种关系。例如,我必须删除布赖恩·诺顿中的一个---玛丽亚·鲁昂或玛丽亚·鲁昂---布赖恩·诺顿,以保持关系的唯一性 ------------------------------------------------------------------------------------------------- | article_title
-------------------------------------------------------------------------------------------------
| article_title | author_name | coauthor_name |
-------------------------------------------------------------------------------------------------
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton | Maria Roo Ons
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton | Max Ammann
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton | S. Shynu
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton | Sarah McCormack
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Maria Roo Ons | Brian Norton
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Maria Roo Ons | Max Ammann
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Maria Roo Ons | S. Shynu
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Maria Roo Ons | Sarah McCormack
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Max Ammann | Brian Norton
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Max Ammann | Maria Roo Ons
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Max Ammann | S. Shynu
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Max Ammann | Sarah McCormack
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | S. Shynu | Brian Norton
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | S. Shynu | Maria Roo Ons
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | S. Shynu | Max Ammann
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | S. Shynu | Sarah McCormack
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Sarah McCormack | Brian Norton
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Sarah McCormack | Maria Roo Ons
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Sarah McCormack | Max Ammann
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Sarah McCormack | S. Shynu
-------------------------------------------------------------------------------------------------
理想的最终输出如下
-------------------------------------------------------------------------------------------------
| article_title | author_name | coauthor_name |
-------------------------------------------------------------------------------------------------
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton | Maria Roo Ons
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton | Max Ammann
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton | S. Shynu
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Brian Norton | Sarah McCormack
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Maria Roo Ons | Max Ammann
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Maria Roo Ons | S. Shynu
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Maria Roo Ons | Sarah McCormack
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Max Ammann | S. Shynu
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | Max Ammann | Sarah McCormack
A Metal Plate Solar Antenna for UMTS Pico-cell Base Station | S. Shynu | Sarah McCormack
在这种情况下,我只想保持一排。在R或Python中如何处理它?
非常感谢您的帮助。我假设您有一个单独的数据库,并且正在使用python与之连接 可能的办法: 1) 您可以根据
文章
列添加行号,然后执行重复数据消除。您可以查看答案,了解如何在SQL中实现它
然后,您可以使用python-db连接器运行查询
2) 您可以将记录拉入pandas数据框并在其中进行分析。善于处理和操纵数据。我假设您的数据帧与我在下面展示的数据帧类似,因为您没有分享可能出现的其他可能性
article author1 author2
A a b
A b a
A a a
A b b
在R中,这就是如何获取您要查找的行的方法。我假设您的数据帧是df1
# This will create a new dataframe df2 with only those rows where author1 and author2 are different
df2 <- df1[df1$author1 != df1$author2, ]
如果这是您需要的,请告诉我。到目前为止,您有没有尝试过什么?您正在使用库/包吗?(Numpy/Pandas代表Python,dplyr或datatables代表R)谢谢你的回答,但我的问题不是删除重复列,问题是如果author1和author2在同一篇文章中具有相同的值但顺序不同,如何删除重复项。当你说相同的值时,你是说
author1
和author2
都应该是a
?不,这意味着在同一篇文章中,我只需要一条记录,如:文章a author1a author2b或文章a author1b author2查看我的更新答案。由于您没有为我提供更大的数据集来全面测试我的代码,我假设author1
和author2
将只包含a
或b
。如果这能帮你解决问题,请告诉我。非常感谢你的帮助。谢谢,我会努力的。
article author1 author2
A a b
A b a