如何通过Python中的pandas库在csv中删除重复项?

如何通过Python中的pandas库在csv中删除重复项?,python,pandas,Python,Pandas,我一直在四处寻找示例,但无法按我希望的方式进行 我想按“OrderID”进行重复数据消除,并提取重复数据以分离CSV。 主要的事情是我需要能够更改我要通过其进行重复数据消除的列,在本例中是它的“订单ID” 示例数据集: 输出: 我试过这个: import pandas as pd df = pd.read_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate example.csv') new_df = df[['I

我一直在四处寻找示例,但无法按我希望的方式进行

我想按“OrderID”进行重复数据消除,并提取重复数据以分离CSV。 主要的事情是我需要能够更改我要通过其进行重复数据消除的列,在本例中是它的“订单ID”

示例数据集:

输出:

我试过这个:

import pandas as pd

df = pd.read_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate example.csv')

new_df = df[['ID','Fruit','Order ID','Quantity','Price']].drop_duplicates()

new_df.to_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate test.csv', index=False)

我遇到的问题是,它不会删除任何重复项。

您可以通过创建一个新的数据帧,并使用value_counts()、合并和过滤来实现这一点

# value_counts returns a Series, to_frame() makes it into DataFrame
df_counts = df['OrderID'].value_counts().to_frame()
# rename the column
df_counts.columns = ['order_counts']

# merging original on column "OrderID" and the counts by it's index
df_merged = pd.merge(df, df_counts, left_on='OrderID', right_index=True)

# Then to get the ones which are duplicate is just the ones that count is higher than 1
df_filtered = df_merged[df_merged['order_counts']>1]

# if you want everything else that isn't a duplicate
df_not_duplicates = df_merged[df_merged['order_counts']==1]
编辑:拖放重复项()只保留唯一的值,但如果发现重复项,则将删除除一个以外的所有值。您可以通过参数“keep”设置它,该参数可以是“first”或“last”

edit2:从您的评论中,您希望将结果导出到csv。 请记住,如上所述,我在两个数据帧中进行了分离:

a) 已删除重复项的所有项目(df_非重复项)

b) 只有具有重复项的项仍然重复(df_已筛选)


如果要使用drop\u duplicates方法,则错误在第二行代码中(应该使用pd.DataFrame)

df = pd.read_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicateexample.csv')

# Create dataframe with duplicates
raw_data = {'ID': [1,2,3,4,5], 
            'Fruit': ['apple', 'Banana', 'Orange','Mango', 'Kiwi'], 
            'Order ID': [1111, 2222, 3333, 4444, 5555], 
        'Quantity': [11, 22, 33, 44, 55],
        'Price': [ 2, 3, 5, 7, 5]}

new_df = pd.DataFrame(raw_data, columns = ['ID','Fruit','Order ID','Quantity','Price']).drop_duplicates()

new_df.to_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate test.csv', index=False)

希望有帮助。

我尝试过的所有示例代码要么不返回数据集,要么只返回完全相同的数据集。我们不能做太多的描述,你能分享一个真实的吗?此外,除非绝对必要,否则请不要以图像形式共享信息。请看:,。嗨,之前的图像是一个示例数据集,没有真实的“真实”或“不真实”,这是相同的想法。您能以方便的格式共享数据吗?您好,谢谢,这会将重复的数据导出为csv吗?您可以将新的数据导出为.csv,我将用example@shaneo,完成。我相信您想要“Type 2”导出。顺便说一句,您的
drop\u duplicates()
不起作用的原因是您没有将参数
subset
设置为所需的列。因此,它尝试在考虑所有列的情况下查找重复项。非常感谢将重复项拉到csv文件中。我希望通过对所需列进行重复数据消除,实现类似excel中重复数据消除的功能。希望我不会感到困惑。谢谢你好,谢谢。问题是,它返回相同的数据,而不仅仅是重复的数据。我现在明白了,Drop方法是错误的。我已经试着弄清楚这一点有一段时间了,并且使用了不同的例子,但对熊猫和熊猫来说仍然是新的。@shaneo,我了解到你是python和熊猫的新手。继续努力。荣誉
# value_counts returns a Series, to_frame() makes it into DataFrame
df_counts = df['OrderID'].value_counts().to_frame()
# rename the column
df_counts.columns = ['order_counts']

# merging original on column "OrderID" and the counts by it's index
df_merged = pd.merge(df, df_counts, left_on='OrderID', right_index=True)

# Then to get the ones which are duplicate is just the ones that count is higher than 1
df_filtered = df_merged[df_merged['order_counts']>1]

# if you want everything else that isn't a duplicate
df_not_duplicates = df_merged[df_merged['order_counts']==1]
# Type 1 saving all OrderIds that had duplicates but still with duplicates:
df_filtered.to_csv("path_to_my_csv//filename.csv", sep=",", encoding="utf-8")

# Type 2, all OrderIDs that had duplicate values, but only 1 line per OrderID
df_filtered.drop_duplicates(subset="OrderID", keep='last').to_csv("path_to_my_csv//filename.csv", sep=",", encoding="utf-8")
df = pd.read_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicateexample.csv')

# Create dataframe with duplicates
raw_data = {'ID': [1,2,3,4,5], 
            'Fruit': ['apple', 'Banana', 'Orange','Mango', 'Kiwi'], 
            'Order ID': [1111, 2222, 3333, 4444, 5555], 
        'Quantity': [11, 22, 33, 44, 55],
        'Price': [ 2, 3, 5, 7, 5]}

new_df = pd.DataFrame(raw_data, columns = ['ID','Fruit','Order ID','Quantity','Price']).drop_duplicates()

new_df.to_csv('C:/Users/shane/PycharmProjects/PythonTut/deduping/duplicate test.csv', index=False)