Python 如果重复值位于下一行的不同列中,则从Dataframe中删除重复项

Python 如果重复值位于下一行的不同列中,则从Dataframe中删除重复项,python,pandas,Python,Pandas,我有一个大数据框,其格式如下: term_x Intersections term_y boxers 1 briefs briefs 1 boxers babies 6 costumes costumes 6 babies babies 12 clothes clothes 12 babies babies 1 clothings clothings 1 babies 这个文件

我有一个大数据框,其格式如下:

term_x  Intersections   term_y

boxers      1   briefs

briefs      1   boxers

babies      6   costumes

costumes    6   babies

babies     12   clothes

clothes    12   babies

babies      1   clothings

clothings   1   babies
这个文件有数百万行。我想做的是减少这些多余的行。是否有任何方法可以使用熊猫重复数据消除功能以一种快速的方式消除这些重复数据?我目前的方法涉及对整个数据帧进行迭代,获取下一行的值,然后删除重复行,但这已被证明是非常缓慢的:

row_iterator = duplicate_df_selfmerge.iterrows()
_, next = row_iterator.__next__()  # take first item from row_iterator
for index, row in row_iterator:
        if (row['term_x'] == next['term_y']) & (row['term_y'] == next['term_x']) & (row['Keyword'] == next['Keyword']):
            duplicate_df_selfmerge.drop(index, inplace=True)
        next = row

您可以将这两列放在一起,对这些对进行排序,然后在这些排序的对上删除行:

df['together'] = [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))]

df.drop_duplicates(subset=['together'])
Out[11]: 
   term_x  Intersections     term_y          together
0  boxers              1     briefs     boxers,briefs
2  babies              6   costumes   babies,costumes
4  babies             12    clothes    babies,clothes
6  babies              1  clothings  babies,clothings
编辑:你说过时间是这个问题的一个重要因素。以下是比较mine和Allen解决方案在200000行数据帧上的时间安排:

while df.shape[0] < 200000:
    df.append(df)

%timeit df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1)
1 loop, best of 3: 6.62 s per loop

%timeit [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))]
10 loops, best of 3: 121 ms per loop
当df.shape[0]<200000时: 追加(df) %timeit df.apply(lambda x:str(已排序([x.term\u x,x.term\u y])),轴=1) 1圈,最佳3圈:每圈6.62秒 %timeit[','。在map中为x加入(x)(已排序,zip(df['term_x'],df['term_y'])) 10个回路,最佳3个:每个回路121毫秒
正如你所看到的,我的方法快了98%以上
pandas.DataFrame.apply在许多情况下都很慢。

您可以将这两列放在一起,对这些对进行排序,然后在这些排序的对上删除行:

df['together'] = [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))]

df.drop_duplicates(subset=['together'])
Out[11]: 
   term_x  Intersections     term_y          together
0  boxers              1     briefs     boxers,briefs
2  babies              6   costumes   babies,costumes
4  babies             12    clothes    babies,clothes
6  babies              1  clothings  babies,clothings
df = pd.DataFrame({'Intersections': {0: 1, 1: 1, 2: 6, 3: 6, 4: 12, 5: 12, 6: 1, 7: 1},
 'term_x': {0: 'boxers',1: 'briefs',2: 'babies',3: 'costumes',4: 'babies',
  5: 'clothes',6: 'babies',7: 'clothings'}, 'term_y': {0: 'briefs',1: 'boxers',
  2: 'costumes',3: 'babies',4: 'clothes',5: 'babies',6: 'clothings',7: 'babies'}})

#create a column to combine team_x and team_y in a sorted order
df['team_xy'] = df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1)
#drop duplicates on the combined fields.
df.drop_duplicates(subset='team_xy',inplace=True)

df
Out[916]: 
   Intersections  term_x     term_y                  team_xy
0              1  boxers     briefs     ['boxers', 'briefs']
2              6  babies   costumes   ['babies', 'costumes']
4             12  babies    clothes    ['babies', 'clothes']
6              1  babies  clothings  ['babies', 'clothings']
编辑:你说过时间是这个问题的一个重要因素。以下是比较mine和Allen解决方案在200000行数据帧上的时间安排:

while df.shape[0] < 200000:
    df.append(df)

%timeit df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1)
1 loop, best of 3: 6.62 s per loop

%timeit [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))]
10 loops, best of 3: 121 ms per loop
当df.shape[0]<200000时: 追加(df) %timeit df.apply(lambda x:str(已排序([x.term\u x,x.term\u y])),轴=1) 1圈,最佳3圈:每圈6.62秒 %timeit[','。在map中为x加入(x)(已排序,zip(df['term_x'],df['term_y'])) 10个回路,最佳3个:每个回路121毫秒

正如你所看到的,我的方法快了98%以上
pandas.DataFrame.apply在许多情况下都很慢。

如何定义“复制”?您希望示例的输出是什么?您的示例也没有
关键字
列。您如何定义“重复”?您希望示例的输出是什么?您的示例也没有
关键字
列。
df = pd.DataFrame({'Intersections': {0: 1, 1: 1, 2: 6, 3: 6, 4: 12, 5: 12, 6: 1, 7: 1},
 'term_x': {0: 'boxers',1: 'briefs',2: 'babies',3: 'costumes',4: 'babies',
  5: 'clothes',6: 'babies',7: 'clothings'}, 'term_y': {0: 'briefs',1: 'boxers',
  2: 'costumes',3: 'babies',4: 'clothes',5: 'babies',6: 'clothings',7: 'babies'}})

#create a column to combine team_x and team_y in a sorted order
df['team_xy'] = df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1)
#drop duplicates on the combined fields.
df.drop_duplicates(subset='team_xy',inplace=True)

df
Out[916]: 
   Intersections  term_x     term_y                  team_xy
0              1  boxers     briefs     ['boxers', 'briefs']
2              6  babies   costumes   ['babies', 'costumes']
4             12  babies    clothes    ['babies', 'clothes']
6              1  babies  clothings  ['babies', 'clothings']