Python 当我将重复数据消除代码链接在一起时,为什么该代码不起作用?
我想在此数据帧中选择重复项:Python 当我将重复数据消除代码链接在一起时,为什么该代码不起作用?,python,pandas,Python,Pandas,我想在此数据帧中选择重复项: df = pd.DataFrame({'firstname':['stack','Bar Bar',np.nan,'Bar Bar','john','mary','jim'], 'lastname':['jim','Bar','Foo Bar','Bar','con','sullivan','Ryan'], 'email':[np.nan,'Bar','Foo Bar','Bar','joh
df = pd.DataFrame({'firstname':['stack','Bar Bar',np.nan,'Bar Bar','john','mary','jim'],
'lastname':['jim','Bar','Foo Bar','Bar','con','sullivan','Ryan'],
'email':[np.nan,'Bar','Foo Bar','Bar','john@com','mary@com','Jim@com']})
print(df)
firstname lastname email
0 stack jim NaN
1 Bar Bar Bar Bar
2 NaN Foo Bar Foo Bar
3 Bar Bar Bar Bar
4 john con john@com
5 mary sullivan mary@com
6 jim Ryan Jim@com
这种方法似乎很有效:
df = df.dropna(subset=['firstname', 'lastname', 'email'])
df = df[df.duplicated(subset=['firstname', 'lastname', 'email'], keep=False)]
print(df)
firstname lastname email
1 Bar Bar Bar Bar
3 Bar Bar Bar Bar
然而,如果我将操作链化,它将不起作用:
dupes = (df.dropna(subset=['firstname', 'lastname', 'email'])
.duplicated(subset=['firstname', 'lastname', 'email'], keep=False))
df = df[dupes]
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
我是否应该远离像这样的链接,保持简单?这是怎么回事?这是我预料到的
第二种解决方案中的问题是使用已筛选的值进行筛选,所以输出索引和原始索引不同,所以引发了错误
print(df)
firstname lastname email
0 stack jim NaN
1 Bar Bar Bar Bar
2 NaN Foo Bar Foo Bar
3 Bar Bar Bar Bar
4 john con john@com
5 mary sullivan mary@com
6 jim Ryan Jim@com
dupes = (df.dropna(subset=['firstname', 'lastname', 'email'])
.duplicated(subset=['firstname', 'lastname', 'email'], keep=False))
print(dupes)
1 True
3 True
4 False
5 False
6 False
dtype: bool
在第一个示例中,您使用已过滤的数据进行过滤,因此索引相同且工作良好:
df = df.dropna(subset=['firstname', 'lastname', 'email'])
print(df)
firstname lastname email
1 Bar Bar Bar Bar
3 Bar Bar Bar Bar
4 john con john@com
5 mary sullivan mary@com
6 jim Ryan Jim@com
print(df.duplicated(subset=['firstname', 'lastname', 'email'], keep=False))
1 True
3 True
4 False
5 False
6 False
dtype: bool
df = df[df.duplicated(subset=['firstname', 'lastname', 'email'], keep=False)]
print(df)
firstname lastname email
1 Bar Bar Bar Bar
3 Bar Bar Bar Bar
可能的解决方案是使用:
在第一个示例中,您通过分配数据帧来更新数据帧,如果在删除na后打印数据帧,您可以看到索引已更改:
df = df.dropna(subset=['firstname', 'lastname', 'email'])
print(df)
firstname lastname email
1 Bar Bar Bar Bar
3 Bar Bar Bar Bar
4 john con john@com
5 mary sullivan mary@com
6 jim Ryan Jim@com
链式操作的问题在于您没有更改数据帧的索引,但dupes系列的行数较少
dupes = df.dropna(subset=['firstname', 'lastname', 'email']).duplicated(subset=['firstname', 'lastname', 'email'], keep=False)
print(dupes)
print(df)
1 True
3 True
4 False
5 False
6 False
dtype: bool
firstname lastname email
0 stack jim NaN
1 Bar Bar Bar Bar
2 NaN Foo Bar Foo Bar
3 Bar Bar Bar Bar
4 john con john@com
5 mary sullivan mary@com
6 jim Ryan Jim@com
当您试图通过使用dupes系列索引从数据帧获取行时,会出现错误,因为索引不匹配。谢谢,重新索引技巧非常方便。
dupes = df.dropna(subset=['firstname', 'lastname', 'email']).duplicated(subset=['firstname', 'lastname', 'email'], keep=False)
print(dupes)
print(df)
1 True
3 True
4 False
5 False
6 False
dtype: bool
firstname lastname email
0 stack jim NaN
1 Bar Bar Bar Bar
2 NaN Foo Bar Foo Bar
3 Bar Bar Bar Bar
4 john con john@com
5 mary sullivan mary@com
6 jim Ryan Jim@com