Python 如何删除数据帧中值顺序不重要的行_Python_Pandas_Dataframe

Python 如何删除数据帧中值顺序不重要的行

python pandas dataframe

Python 如何删除数据帧中值顺序不重要的行,python,pandas,dataframe,Python,Pandas,Dataframe,我有这样一个数据帧： source target weight 1 2 5 2 1 5 1 2 5 1 2 7 3 1 6 1 1 6 1 3 6 我的目标是删除重复的行，但源列和目标列的顺序并不重要。事实上，两列的顺序并不

我有这样一个数据帧：

source   target   weight
     1       2         5
     2       1         5
     1       2         5
     1       2         7
     3       1         6
     1       1         6
     1       3         6

我的目标是删除重复的行，但源列和目标列的顺序并不重要。事实上，两列的顺序并不重要，应该删除它们。在这种情况下，预期的结果是

source   target   weight
     1       2         5
     1       2         7
     3       1         6
     1       1         6

有没有办法做到这一点

应该相当容易

data = [[1,2,5],
[2,1,5],
[1,2,5],
[3,1,6],
[1,1,6],
[1,3,6],
]
df = pd.DataFrame(data,columns=['source','target','weight'])

您可以使用

drop\u duplicates

df = df.drop_duplicates(keep=False)
print(df)

将导致：

      source  target  weight
1       2       1       5
3       3       1       6
4       1       1       6
5       1       3       6

因为您希望处理无序的源/目标问题

def pair(row):
    sorted_pair = sorted([row['source'],row['target']])
    row['source'] =  sorted_pair[0]
    row['target'] = sorted_pair[1]
    return row
df = df.apply(pair,axis=1)

然后您可以使用

df.drop\u duplicates（）

使用

frozenset

和

duplicated

df[~df[['source', 'target']].apply(frozenset, 1).duplicated()]

   source  target  weight
0       1       2       5
3       3       1       6
4       1       1       6

如果您想说明无序的

源代码

目标代码

和

权重

df[~df[['weight']].assign(A=df[['source', 'target']].apply(frozenset, 1)).duplicated()]

   source  target  weight
0       1       2       5
3       1       2       7
4       3       1       6
5       1       1       6

但是，要使用更具可读性的代码进行显式处理

# Create series where values are frozensets and therefore hashable.
# With hashable things, we can determine duplicity.
# Note that I also set the index and name to set up for a convenient `join`
s = pd.Series(list(map(frozenset, zip(df.source, df.target))), df.index, name='mixed')

# Use `drop` to focus on just those columns leaving whatever else is there.
# This is more general and accommodates more than just a `weight` column.
mask = df.drop(['source', 'target'], axis=1).join(s).duplicated()

df[~mask]

   source  target  weight
0       1       2       5
3       1       2       7
4       3       1       6
5       1       1       6

见并@VenkataGogu，这不是该问题的重复。试试它

df=pd.DataFrame（{'a'：[1,2,3]，'b'：[2,1,1]，'c'：[1,3,2]}）

和

df=df.drop_duplicates（subset=['a'，'b'，keep=False）

。所有3行仍然存在。题目具体，数据清晰；OP希望删除值可以跨列子集显示的重复项，但它们出现在哪一列并不重要，事实上，我想不出一种优雅的方法来做到这一点，而不会随着列数的增加而失控。问得好。事实上，权重的值（第三列）很重要。你刚刚更新了预期结果，你能解释一下发生了什么变化吗？第3、5行仍然是重复的。。。我甚至在这个问题下发表了评论，为什么它不像你说的那么简单。您的输出甚至与OP的预期输出不匹配。我已经删除了我的下一票，但这现在默认为python速度。没错，任何使用本机python类型而不是numpy类型和运算符的解决方案都会导致正常的CPython执行速度。piRSquared的解决方案也是如此。

# Create series where values are frozensets and therefore hashable.
# With hashable things, we can determine duplicity.
# Note that I also set the index and name to set up for a convenient `join`
s = pd.Series(list(map(frozenset, zip(df.source, df.target))), df.index, name='mixed')

# Use `drop` to focus on just those columns leaving whatever else is there.
# This is more general and accommodates more than just a `weight` column.
mask = df.drop(['source', 'target'], axis=1).join(s).duplicated()

df[~mask]

   source  target  weight
0       1       2       5
3       1       2       7
4       3       1       6
5       1       1       6