Python 如果某个特定的其他列不为null(0),如何删除重复项但保留行
我有许多重复的记录,其中一些有银行账户。我想把记录保存在银行账户上Python 如果某个特定的其他列不为null(0),如何删除重复项但保留行,python,pandas,duplicates,Python,Pandas,Duplicates,我有许多重复的记录,其中一些有银行账户。我想把记录保存在银行账户上 df = pd.DataFrame({'firstname':['foo Bar','Bar Bar','Foo Bar','jim','john','mary','jim'], 'lastname':['Foo Bar','Bar','Foo Bar','ryan','con','sullivan','Ryan'], 'email':['Foo bar
df = pd.DataFrame({'firstname':['foo Bar','Bar Bar','Foo Bar','jim','john','mary','jim'],
'lastname':['Foo Bar','Bar','Foo Bar','ryan','con','sullivan','Ryan'],
'email':['Foo bar','Bar','Foo Bar','jim@com','john@com','mary@com','Jim@com'],
'bank':[np.nan,'abc','xyz',np.nan,'tge','vbc','dfg']})
df
firstname lastname email bank
0 foo Bar Foo Bar Foo bar NaN
1 Bar Bar Bar Bar abc
2 Foo Bar Foo Bar Foo Bar xyz
3 jim ryan jim@com NaN
4 john con john@com tge
5 mary sullivan mary@com vbc
6 jim Ryan Jim@com dfg
# get the index of unique values, based on firstname, lastname, email
# convert to lower and remove white space first
uniq_indx = (df.dropna(subset=['firstname', 'lastname', 'email'])
.applymap(lambda s:s.lower() if type(s) == str else s)
.applymap(lambda x: x.replace(" ", "") if type(x)==str else x)
.drop_duplicates(subset=['firstname', 'lastname', 'email'], keep='first')).index
# save unique records
dfiban_uniq = df.loc[uniq_indx]
dfiban_uniq
firstname lastname email bank
0 foo Bar Foo Bar Foo bar NaN # should not be here
1 Bar Bar Bar Bar abc
3 jim ryan jim@com NaN # should not be here
4 john con john@com tge
5 mary sullivan mary@com vbc
# I wanted these duplicates to appear in the result:
firstname lastname email bank
2 Foo Bar Foo Bar Foo Bar xyz
6 jim Ryan Jim@com dfg
基本上是这样的:
if there are two Tommy Joes:
keep the one with a bank account
我曾尝试使用下面的代码进行重复数据消除,但它没有银行帐户就保留了重复数据
df = pd.DataFrame({'firstname':['foo Bar','Bar Bar','Foo Bar','jim','john','mary','jim'],
'lastname':['Foo Bar','Bar','Foo Bar','ryan','con','sullivan','Ryan'],
'email':['Foo bar','Bar','Foo Bar','jim@com','john@com','mary@com','Jim@com'],
'bank':[np.nan,'abc','xyz',np.nan,'tge','vbc','dfg']})
df
firstname lastname email bank
0 foo Bar Foo Bar Foo bar NaN
1 Bar Bar Bar Bar abc
2 Foo Bar Foo Bar Foo Bar xyz
3 jim ryan jim@com NaN
4 john con john@com tge
5 mary sullivan mary@com vbc
6 jim Ryan Jim@com dfg
# get the index of unique values, based on firstname, lastname, email
# convert to lower and remove white space first
uniq_indx = (df.dropna(subset=['firstname', 'lastname', 'email'])
.applymap(lambda s:s.lower() if type(s) == str else s)
.applymap(lambda x: x.replace(" ", "") if type(x)==str else x)
.drop_duplicates(subset=['firstname', 'lastname', 'email'], keep='first')).index
# save unique records
dfiban_uniq = df.loc[uniq_indx]
dfiban_uniq
firstname lastname email bank
0 foo Bar Foo Bar Foo bar NaN # should not be here
1 Bar Bar Bar Bar abc
3 jim ryan jim@com NaN # should not be here
4 john con john@com tge
5 mary sullivan mary@com vbc
# I wanted these duplicates to appear in the result:
firstname lastname email bank
2 Foo Bar Foo Bar Foo Bar xyz
6 jim Ryan Jim@com dfg
您可以看到索引0和3被保留。已删除这些具有银行帐户的客户的版本。我的预期结果是相反的。删除没有银行帐户的复制品
df = pd.DataFrame({'firstname':['foo Bar','Bar Bar','Foo Bar','jim','john','mary','jim'],
'lastname':['Foo Bar','Bar','Foo Bar','ryan','con','sullivan','Ryan'],
'email':['Foo bar','Bar','Foo Bar','jim@com','john@com','mary@com','Jim@com'],
'bank':[np.nan,'abc','xyz',np.nan,'tge','vbc','dfg']})
df
firstname lastname email bank
0 foo Bar Foo Bar Foo bar NaN
1 Bar Bar Bar Bar abc
2 Foo Bar Foo Bar Foo Bar xyz
3 jim ryan jim@com NaN
4 john con john@com tge
5 mary sullivan mary@com vbc
6 jim Ryan Jim@com dfg
# get the index of unique values, based on firstname, lastname, email
# convert to lower and remove white space first
uniq_indx = (df.dropna(subset=['firstname', 'lastname', 'email'])
.applymap(lambda s:s.lower() if type(s) == str else s)
.applymap(lambda x: x.replace(" ", "") if type(x)==str else x)
.drop_duplicates(subset=['firstname', 'lastname', 'email'], keep='first')).index
# save unique records
dfiban_uniq = df.loc[uniq_indx]
dfiban_uniq
firstname lastname email bank
0 foo Bar Foo Bar Foo bar NaN # should not be here
1 Bar Bar Bar Bar abc
3 jim ryan jim@com NaN # should not be here
4 john con john@com tge
5 mary sullivan mary@com vbc
# I wanted these duplicates to appear in the result:
firstname lastname email bank
2 Foo Bar Foo Bar Foo Bar xyz
6 jim Ryan Jim@com dfg
我曾考虑过先按银行账户排序,但我有这么多数据,我不确定如何“感觉检查”它,看看它是否有效
谢谢你的帮助
这里有一些类似的问题,但它们似乎都有可以排序的值,如年龄等。这些散列的银行账号非常混乱
编辑:
在我的真实数据集上尝试答案的一些结果
@Erfan方法按子集+银行对值进行排序
58594重复数据消除后剩余的记录:
subset = ['firstname', 'lastname']
df[subset] = df[subset].apply(lambda x: x.str.lower())
df[subset] = df[subset].apply(lambda x: x.replace(" ", ""))
df.sort_values(subset + ['bank'], inplace=True)
df.drop_duplicates(subset, inplace=True)
print(df.shape[0])
58594
uniq_indx = (df.sort_values(by="bank", na_position='last').dropna(subset=['firstname', 'lastname', 'email'])
.applymap(lambda s: s.lower() if type(s) == str else s)
.applymap(lambda x: x.replace(" ", "") if type(x) == str else x)
.drop_duplicates(subset=['firstname', 'lastname', 'email'], keep='first')).index
df.loc[uniq_indx].shape[0]
59170
@Adam.Er8使用银行排序值进行回答。59170重复数据消除后剩余的记录:
subset = ['firstname', 'lastname']
df[subset] = df[subset].apply(lambda x: x.str.lower())
df[subset] = df[subset].apply(lambda x: x.replace(" ", ""))
df.sort_values(subset + ['bank'], inplace=True)
df.drop_duplicates(subset, inplace=True)
print(df.shape[0])
58594
uniq_indx = (df.sort_values(by="bank", na_position='last').dropna(subset=['firstname', 'lastname', 'email'])
.applymap(lambda s: s.lower() if type(s) == str else s)
.applymap(lambda x: x.replace(" ", "") if type(x) == str else x)
.drop_duplicates(subset=['firstname', 'lastname', 'email'], keep='first')).index
df.loc[uniq_indx].shape[0]
59170
不知道为什么会有差异,但两者都很相似 您应该使用
na_position='last'
按bank
列对值进行排序(这样删除重复项(…,keep='first')
将保留一个非na的值)
试试这个:
将熊猫作为pd导入
将numpy作为np导入
df=pd.DataFrame({'firstname':['foobar','Bar','foobar'],
'lastname':['Foo-Bar','Bar','Foo-Bar'],
'email':['Foo-bar','bar','Foo-bar'],
'银行':[np.nan',abc',xyz']})
uniq_indx=(df.sort_值(by=“bank”,na_position='last').dropna(子集=['firstname','lastname','email']))
.applymap(lambda s:s.lower(),如果类型==str else s)
.applymap(lambda x:x.replace(“,”),如果类型(x)==str else x)
.drop_重复项(子集=['firstname','lastname','email'],keep='first')).index
#保存唯一记录
dfiban_uniq=df.loc[uniq_indx]
打印(dfiban_uniq)
输出:
bank email firstname lastname
1 abc Bar Bar Bar Bar
2 xyz Foo Bar Foo Bar Foo Bar
(这只是您的原始代码,
.sort_value(by=“bank”,na_position='last')
位于uniq_indx=…
的开头)在删除重复项之前按降序排序值。这将确保NaN不会出现在最上面您可以在删除重复项之前按银行帐户排序
,将重复项与NaN
放在最后:
uniq_indx=(df.dropna(子集=['firstname','lastname','email']))
.applymap(lambda s:s.lower(),如果类型==str else s)
.applymap(lambda x:x.replace(“,”),如果类型(x)==str else x)
.sort_值(by='bank')#这里我们按bank列对值进行排序
.drop_重复项(子集=['firstname','lastname','email'],keep='first')).index
方法1:str.lower,排序并删除重复项
这也适用于许多列
方法2:groupby,agg,first 不容易推广到许多列
为什么不添加
keep='last'
而不是keep='first'
?最后有银行最后有银行在我的玩具例子。我不知道它们在真实数据集中的顺序。@SCool,你能用更多的记录和扩展的预期结果来扩展你的输入样本吗(以覆盖潜在的边缘情况)?我刚刚添加了更多的数据。不确定需要多少额外的记录。谢谢,这比我的要优雅得多,但我的真实数据集中有58列。这就是为什么我使用drop_duplicates
和subset=['firstname','lastname','email']
参数的原因。添加了另一个方法@SCoolI刚刚编辑了我的原始问题,以发布真实数据集的结果。您从子集中删除“电子邮件”是否有任何原因?另外,按子集+['bank']
排序比单独按['bank']
排序有什么作用?谢谢,我已经将您的答案与另一个答案进行了比较。我已经编辑了我的原始问题。