Python 如果重复值位于下一行的不同列中,则从Dataframe中删除重复项
我有一个大数据框,其格式如下:Python 如果重复值位于下一行的不同列中,则从Dataframe中删除重复项,python,pandas,Python,Pandas,我有一个大数据框,其格式如下: term_x Intersections term_y boxers 1 briefs briefs 1 boxers babies 6 costumes costumes 6 babies babies 12 clothes clothes 12 babies babies 1 clothings clothings 1 babies 这个文件
term_x Intersections term_y
boxers 1 briefs
briefs 1 boxers
babies 6 costumes
costumes 6 babies
babies 12 clothes
clothes 12 babies
babies 1 clothings
clothings 1 babies
这个文件有数百万行。我想做的是减少这些多余的行。是否有任何方法可以使用熊猫重复数据消除功能以一种快速的方式消除这些重复数据?我目前的方法涉及对整个数据帧进行迭代,获取下一行的值,然后删除重复行,但这已被证明是非常缓慢的:
row_iterator = duplicate_df_selfmerge.iterrows()
_, next = row_iterator.__next__() # take first item from row_iterator
for index, row in row_iterator:
if (row['term_x'] == next['term_y']) & (row['term_y'] == next['term_x']) & (row['Keyword'] == next['Keyword']):
duplicate_df_selfmerge.drop(index, inplace=True)
next = row
您可以将这两列放在一起,对这些对进行排序,然后在这些排序的对上删除行:
df['together'] = [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))]
df.drop_duplicates(subset=['together'])
Out[11]:
term_x Intersections term_y together
0 boxers 1 briefs boxers,briefs
2 babies 6 costumes babies,costumes
4 babies 12 clothes babies,clothes
6 babies 1 clothings babies,clothings
编辑:你说过时间是这个问题的一个重要因素。以下是比较mine和Allen解决方案在200000行数据帧上的时间安排:
while df.shape[0] < 200000:
df.append(df)
%timeit df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1)
1 loop, best of 3: 6.62 s per loop
%timeit [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))]
10 loops, best of 3: 121 ms per loop
当df.shape[0]<200000时:
追加(df)
%timeit df.apply(lambda x:str(已排序([x.term\u x,x.term\u y])),轴=1)
1圈,最佳3圈:每圈6.62秒
%timeit[','。在map中为x加入(x)(已排序,zip(df['term_x'],df['term_y']))
10个回路,最佳3个:每个回路121毫秒
正如你所看到的,我的方法快了98%以上
pandas.DataFrame.apply在许多情况下都很慢。您可以将这两列放在一起,对这些对进行排序,然后在这些排序的对上删除行:
df['together'] = [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))]
df.drop_duplicates(subset=['together'])
Out[11]:
term_x Intersections term_y together
0 boxers 1 briefs boxers,briefs
2 babies 6 costumes babies,costumes
4 babies 12 clothes babies,clothes
6 babies 1 clothings babies,clothings
df = pd.DataFrame({'Intersections': {0: 1, 1: 1, 2: 6, 3: 6, 4: 12, 5: 12, 6: 1, 7: 1},
'term_x': {0: 'boxers',1: 'briefs',2: 'babies',3: 'costumes',4: 'babies',
5: 'clothes',6: 'babies',7: 'clothings'}, 'term_y': {0: 'briefs',1: 'boxers',
2: 'costumes',3: 'babies',4: 'clothes',5: 'babies',6: 'clothings',7: 'babies'}})
#create a column to combine team_x and team_y in a sorted order
df['team_xy'] = df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1)
#drop duplicates on the combined fields.
df.drop_duplicates(subset='team_xy',inplace=True)
df
Out[916]:
Intersections term_x term_y team_xy
0 1 boxers briefs ['boxers', 'briefs']
2 6 babies costumes ['babies', 'costumes']
4 12 babies clothes ['babies', 'clothes']
6 1 babies clothings ['babies', 'clothings']
编辑:你说过时间是这个问题的一个重要因素。以下是比较mine和Allen解决方案在200000行数据帧上的时间安排:
while df.shape[0] < 200000:
df.append(df)
%timeit df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1)
1 loop, best of 3: 6.62 s per loop
%timeit [','.join(x) for x in map(sorted, zip(df['term_x'], df['term_y']))]
10 loops, best of 3: 121 ms per loop
当df.shape[0]<200000时:
追加(df)
%timeit df.apply(lambda x:str(已排序([x.term\u x,x.term\u y])),轴=1)
1圈,最佳3圈:每圈6.62秒
%timeit[','。在map中为x加入(x)(已排序,zip(df['term_x'],df['term_y']))
10个回路,最佳3个:每个回路121毫秒
正如你所看到的,我的方法快了98%以上
pandas.DataFrame.apply在许多情况下都很慢。如何定义“复制”?您希望示例的输出是什么?您的示例也没有关键字列。您如何定义“重复”?您希望示例的输出是什么?您的示例也没有关键字
列。
df = pd.DataFrame({'Intersections': {0: 1, 1: 1, 2: 6, 3: 6, 4: 12, 5: 12, 6: 1, 7: 1},
'term_x': {0: 'boxers',1: 'briefs',2: 'babies',3: 'costumes',4: 'babies',
5: 'clothes',6: 'babies',7: 'clothings'}, 'term_y': {0: 'briefs',1: 'boxers',
2: 'costumes',3: 'babies',4: 'clothes',5: 'babies',6: 'clothings',7: 'babies'}})
#create a column to combine team_x and team_y in a sorted order
df['team_xy'] = df.apply(lambda x: str(sorted([x.term_x,x.term_y])), axis=1)
#drop duplicates on the combined fields.
df.drop_duplicates(subset='team_xy',inplace=True)
df
Out[916]:
Intersections term_x term_y team_xy
0 1 boxers briefs ['boxers', 'briefs']
2 6 babies costumes ['babies', 'costumes']
4 12 babies clothes ['babies', 'clothes']
6 1 babies clothings ['babies', 'clothings']