Python 如何删除数据帧中值顺序不重要的行
我有这样一个数据帧:Python 如何删除数据帧中值顺序不重要的行,python,pandas,dataframe,Python,Pandas,Dataframe,我有这样一个数据帧: source target weight 1 2 5 2 1 5 1 2 5 1 2 7 3 1 6 1 1 6 1 3 6 我的目标是删除重复的行,但源列和目标列的顺序并不重要。事实上,两列的顺序并不
source target weight
1 2 5
2 1 5
1 2 5
1 2 7
3 1 6
1 1 6
1 3 6
我的目标是删除重复的行,但源列和目标列的顺序并不重要。事实上,两列的顺序并不重要,应该删除它们。在这种情况下,预期的结果是
source target weight
1 2 5
1 2 7
3 1 6
1 1 6
有没有办法做到这一点 应该相当容易
data = [[1,2,5],
[2,1,5],
[1,2,5],
[3,1,6],
[1,1,6],
[1,3,6],
]
df = pd.DataFrame(data,columns=['source','target','weight'])
您可以使用drop\u duplicates
df = df.drop_duplicates(keep=False)
print(df)
将导致:
source target weight
1 2 1 5
3 3 1 6
4 1 1 6
5 1 3 6
因为您希望处理无序的源/目标问题
def pair(row):
sorted_pair = sorted([row['source'],row['target']])
row['source'] = sorted_pair[0]
row['target'] = sorted_pair[1]
return row
df = df.apply(pair,axis=1)
然后您可以使用df.drop\u duplicates()
使用
frozenset
和duplicated
df[~df[['source', 'target']].apply(frozenset, 1).duplicated()]
source target weight
0 1 2 5
3 3 1 6
4 1 1 6
如果您想说明无序的源代码
/目标代码
和权重
df[~df[['weight']].assign(A=df[['source', 'target']].apply(frozenset, 1)).duplicated()]
source target weight
0 1 2 5
3 1 2 7
4 3 1 6
5 1 1 6
但是,要使用更具可读性的代码进行显式处理
# Create series where values are frozensets and therefore hashable.
# With hashable things, we can determine duplicity.
# Note that I also set the index and name to set up for a convenient `join`
s = pd.Series(list(map(frozenset, zip(df.source, df.target))), df.index, name='mixed')
# Use `drop` to focus on just those columns leaving whatever else is there.
# This is more general and accommodates more than just a `weight` column.
mask = df.drop(['source', 'target'], axis=1).join(s).duplicated()
df[~mask]
source target weight
0 1 2 5
3 1 2 7
4 3 1 6
5 1 1 6
见并@VenkataGogu,这不是该问题的重复。试试它
df=pd.DataFrame({'a':[1,2,3],'b':[2,1,1],'c':[1,3,2]})
和df=df.drop_duplicates(subset=['a','b',keep=False)
。所有3行仍然存在。题目具体,数据清晰;OP希望删除值可以跨列子集显示的重复项,但它们出现在哪一列并不重要,事实上,我想不出一种优雅的方法来做到这一点,而不会随着列数的增加而失控。问得好。事实上,权重的值(第三列)很重要。你刚刚更新了预期结果,你能解释一下发生了什么变化吗?第3、5行仍然是重复的。。。我甚至在这个问题下发表了评论,为什么它不像你说的那么简单。您的输出甚至与OP的预期输出不匹配。我已经删除了我的下一票,但这现在默认为python速度。没错,任何使用本机python类型而不是numpy类型和运算符的解决方案都会导致正常的CPython执行速度。piRSquared的解决方案也是如此。
# Create series where values are frozensets and therefore hashable.
# With hashable things, we can determine duplicity.
# Note that I also set the index and name to set up for a convenient `join`
s = pd.Series(list(map(frozenset, zip(df.source, df.target))), df.index, name='mixed')
# Use `drop` to focus on just those columns leaving whatever else is there.
# This is more general and accommodates more than just a `weight` column.
mask = df.drop(['source', 'target'], axis=1).join(s).duplicated()
df[~mask]
source target weight
0 1 2 5
3 1 2 7
4 3 1 6
5 1 1 6