Python 从数据帧中删除反向重复项_Python_Pandas_Dataframe

Python 从数据帧中删除反向重复项

python pandas dataframe

Python 从数据帧中删除反向重复项,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个包含两列的数据框，a和B。在这种情况下，A和B的顺序并不重要；例如，我会考虑（0，50）< /代码>和（50，0）< /代码>是重复的。在pandas中，从数据帧中删除这些重复项的有效方法是什么 import pandas as pd # Initial data frame. data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50], 'B': [50, 22, 35, 5, 10,

我有一个包含两列的数据框，

和

。在这种情况下，

和

的顺序并不重要；例如，我会考虑<代码>（0，50）< /代码>和<代码>（50，0）< /代码>是重复的。在pandas中，从数据帧中删除这些重复项的有效方法是什么

import pandas as pd

# Initial data frame.
data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50], 
                     'B': [50, 22, 35, 5, 10, 11, 21, 0]})
data
    A   B
0   0  50
1  10  22
2  11  35
3  21   5
4  22  10
5  35  11
6   5  21
7  50   0

# Desired output with "duplicates" removed. 
data2 = pd.DataFrame({'A': [0, 5, 10, 11], 
                      'B': [50, 21, 22, 35]})
data2
    A   B
0   0  50
1   5  21
2  10  22
3  11  35

理想情况下，输出将按列

的值排序。您可以在删除重复项之前对数据帧的每一行进行排序：

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()

#   A    B
#0  0   50
#1  10  22
#2  11  35
#3  5   21

如果您希望结果按列

排序：

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')

#   A    B
#0  0   50
#3  5   21
#1  10  22
#2  11  35

这里有一个更丑陋但更快的解决方案：

In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates()
Out[44]:
    A   B
0   0  50
1  10  22
2  11  35
3   5  21

定时：用于8K行DF

In [50]: big = pd.concat([data] * 10**3, ignore_index=True)

In [51]: big.shape
Out[51]: (8000, 2)

In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
1 loop, best of 3: 3.04 s per loop

In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
100 loops, best of 3: 3.96 ms per loop

In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates()
1 loop, best of 3: 2.69 s per loop

现在这个解决方案起作用了

data.set_index(['A','B']).stack().drop_duplicates().unstack().reset_index()

可以根据需要添加更多的列。 e、 g

不需要lambda，

.apply（排序，axis=1）

就可以了。我喜欢这个答案！我想到的一切都是叠加到数据帧。这种聪明消除了这种需要，这与矢量化实现的答案是一样的。不丑陋的：-）

data.set_index(['A','B', 'C']).stack().drop_duplicates().unstack().reset_index()