Python 键列具有重复的值。I';我正在尝试合并数据帧
我想在customerEmail上将Python 键列具有重复的值。I';我正在尝试合并数据帧,python,pandas,join,merge,Python,Pandas,Join,Merge,我想在customerEmail上将DF1与DF2合并,这在两者中都很常见,但具有重复值。 DF1 DF2 请注意,DF1和DF2仅用于示例目的 customerEmail在这两个数据帧中不是唯一的 因此,当我使用pd.merge(DF1,DF2,on='customerEmail',how=left)连接这两个表时,当相同的customerEmail重复时,它会用上面一行的随机值填充我的目标列Fraud 当customerEmail中存在重复项时,我希望我的Fraud列具有空值 custome
DF1
与DF2
合并,这在两者中都很常见,但具有重复值。
DF1
DF2
请注意,DF1和DF2仅用于示例目的
customerEmail
在这两个数据帧中不是唯一的
因此,当我使用pd.merge(DF1,DF2,on='customerEmail',how=left)
连接这两个表时,当相同的customerEmail
重复时,它会用上面一行的随机值填充我的目标列Fraud
当customerEmail
中存在重复项时,我希望我的Fraud
列具有空值
customer\u电子邮件
在两个数据帧中都不是唯一的
电流输出:
customer_Email ID Fraud
name_0 0 False
name_1 1 True
name_2 2 True
name_3 3 True
name_4 4 False
name_1 5 True
name_2 0 True
name_1 1 True
name_3 2 True
预期产出:
customer_Email ID Fraud
name_0 0 False
name_1 1 True
name_2 2 True
name_3 3 True
name_4 4 False
name_1 5 N/A
name_2 0 N/A
name_1 1 N/A
name_3 2 N/A
下面呢?(假设客户电子邮件在df2中是唯一的):
输出:
customer_Email Fraud ID
0 name_0 False 0
1 name_1 True 1
2 name_1 N/A 5
3 name_2 True 2
4 name_3 True 3
5 name_4 False 4
6 name_1 N/A 1
7 name_1 N/A 5
“当CustomerMail中存在重复项时,希望我的欺诈列具有空值。”
因此,在预期的输出中,您忘记在customerEmail
中添加name\u 4
,因为它也是重复的
df1 = pd.DataFrame({
'customerEmail':['name0','name1','name2','name3','name4','name1'],
'Fraud':[False,True,True,True,False,False]
}
)
df2 = pd.DataFrame({
'customerEmail': ['name0', 'name1', 'name2', 'name3', 'name4', 'name1'],
'ID':[0,1,2,3,4,5]
})
df3=pd.merge(df1, df2, on='customerEmail', how='left')
#here you need to know which customers are duplicated, to fill for them rows in column Fraud
df_duplicates = df3.drop_duplicates(subset=['customerEmail'],keep='last')
print(df_duplicates)
customerEmail Fraud ID
0 name0 False 0
3 name2 True 2
4 name3 True 3
5 name4 False 4
7 name1 False 5
#now for those duplicates fill cells in column Fraud using iloc and np.nan
df_duplicates.loc[:,'Fraud'] = np.nan
print(df_duplicates)
customerEmail Fraud ID
0 name0 NaN 0
3 name2 NaN 2
4 name3 NaN 3
5 name4 NaN 4
7 name1 NaN 5
#so now you have two df's , one df_duplicates with Nans duplicates values above,
#and main df3 with original merged values
#now you need to add those df's using concat , (add column to column )
#but you dont need values with same customerEmail that you used for df_duplicated, so you can delete them using drop_duplicates
result = pd.concat([df3,df_duplicates]).drop_duplicates(subset=['customerEmail','Fraud'])
#after concat True and False values has been coverted to 1.0 and 0 , for we need to change the type to False and True again
result.Fraud = result.Fraud.astype('boolean')
print(result)
customerEmail Fraud ID
0 name0 False 0
1 name1 True 1
3 name2 True 2
4 name3 True 3
5 name4 False 4
6 name1 False 1
0 name0 <NA> 0
3 name2 <NA> 2
4 name3 <NA> 3
5 name4 <NA> 4
7 name1 <NA> 5
df1=pd.DataFrame({
'customerEmail':['name0'、'name1'、'name2'、'name3'、'name4'、'name1'],
“欺诈”:[假,真,真,真,假,假]
}
)
df2=pd.DataFrame({
“customerEmail”:[“名称0”、“名称1”、“名称2”、“名称3”、“名称4”、“名称1]”,
“ID”:[0,1,2,3,4,5]
})
df3=pd.merge(df1,df2,on='customerEmail',how='left')
#在这里,您需要知道哪些客户是重复的,以便在欺诈列中为他们填写行
df_duplicates=df3.drop_duplicates(子集=['customerEmail'],keep='last')
打印(df_副本)
客户邮件欺诈ID
0名称0错误0
3姓名2真实姓名2
4姓名3真实3
5姓名4假4
7姓名1假5
#现在,对于这些重复项,使用iloc和np.nan在列欺诈中填充单元格
df_duplicates.loc[:,'Fraud']=np.nan
打印(df_副本)
客户邮件欺诈ID
0名称0 NaN 0
3姓名2楠2
4姓名3南3
5姓名4南4
7姓名1南5
#现在有两个df,一个df_重复,上面有Nans重复值,
#和具有原始合并值的主df3
#现在您需要使用concat添加这些df(将列添加到列)
#但是,您不需要使用与df_duplicated相同的customerEmail值,因此您可以使用drop_duplicates删除它们
结果=pd.concat([df3,df_duplicates])。删除_duplicates(子集=['customerEmail','Fraud']))
#concat True和False值转换为1.0和0后,因为我们需要再次将类型更改为False和True
result.Fraud=result.Fraud.astype('boolean')
打印(结果)
客户邮件欺诈ID
0名称0错误0
1名称1正确1
3姓名2真实姓名2
4姓名3真实3
5姓名4假4
6姓名1假1
0名称0
3姓名2
4姓名3 3
5姓名4
7姓名15
给出:
customer_Email Fraud ID
0 name_0 0.0 0
1 name_1 NaN 1
2 name_1 NaN 5
3 name_2 1.0 2
4 name_3 1.0 3
5 name_4 0.0 4
6 name_1 NaN 1
7 name_1 NaN 5
您可以使用keep=False
功能获取df1和/或df2中的重复电子邮件。下面是代码,用于在df1或df2中为任何具有重复电子邮件的行设置N/A
df=pd.merge(DF1,DF2,on='customerEmail',how='left')
dups_1=set(DF1.customerEmail[DF1.customerEmail.duplicated(keep=False)])#在DF1中获取重复的电子邮件
dups_2=set(DF2.customerEmail[DF2.customerEmail.duplicated(keep=False)])#在DF2中获取重复的电子邮件
dups=dups_1.union(dups_2)#获取df1或df2中的重复电子邮件(您也可以仅使用dups_1或dups_2)
df[“欺诈”]=df.apply(lambda行:如果行为DUP中的customerEmail,则为“N/A”,否则为行。欺诈,axis=1)#如果电子邮件为DUP,则为“N/A”
欢迎来到stack!你能把你的数据帧作为内联代码发布吗?这样人们就可以直接复制它。您还可以阅读如何创建。您的代码当前返回的数据帧是什么?您能否更具体一些,给我们一些具有所需输出的示例数据,而不是屏幕截图?也许您可以左键将DF1连接到DF2pd.merge(DF2,DF1,on='customerEmail',how=left)
,但当存在重复的customerEmail时,这不会给您“空值”欢迎使用。首先,请编辑您的问题并删除指向外部站点的链接。将表格作为代码发布到网站上,而不是链接到某一天可能会消失的外部图像,这样用户就无法理解这个问题了。也就是说,请发布一个预期输出。您希望结果表的外观如何?合并在合并列中有重复条目的数据上不能很好地工作是有原因的,如果显示预期的输出,如果在DF1
hadFraud==False
中第一次出现name_1
,第二次出现True
,则将数据作为文本而不是图像发布。。。您是否希望合并df中的第一个(False
)出现?或者True
如果df1中的任何一行为True?感谢您的尝试谢谢您只需将那些不是第一行的欺诈
值替换为“N/A”
谢谢
customer_Email Fraud ID
0 name_0 False 0
1 name_1 True 1
2 name_1 N/A 5
3 name_2 True 2
4 name_3 True 3
5 name_4 False 4
6 name_1 N/A 1
7 name_1 N/A 5
df1 = pd.DataFrame({
'customerEmail':['name0','name1','name2','name3','name4','name1'],
'Fraud':[False,True,True,True,False,False]
}
)
df2 = pd.DataFrame({
'customerEmail': ['name0', 'name1', 'name2', 'name3', 'name4', 'name1'],
'ID':[0,1,2,3,4,5]
})
df3=pd.merge(df1, df2, on='customerEmail', how='left')
#here you need to know which customers are duplicated, to fill for them rows in column Fraud
df_duplicates = df3.drop_duplicates(subset=['customerEmail'],keep='last')
print(df_duplicates)
customerEmail Fraud ID
0 name0 False 0
3 name2 True 2
4 name3 True 3
5 name4 False 4
7 name1 False 5
#now for those duplicates fill cells in column Fraud using iloc and np.nan
df_duplicates.loc[:,'Fraud'] = np.nan
print(df_duplicates)
customerEmail Fraud ID
0 name0 NaN 0
3 name2 NaN 2
4 name3 NaN 3
5 name4 NaN 4
7 name1 NaN 5
#so now you have two df's , one df_duplicates with Nans duplicates values above,
#and main df3 with original merged values
#now you need to add those df's using concat , (add column to column )
#but you dont need values with same customerEmail that you used for df_duplicated, so you can delete them using drop_duplicates
result = pd.concat([df3,df_duplicates]).drop_duplicates(subset=['customerEmail','Fraud'])
#after concat True and False values has been coverted to 1.0 and 0 , for we need to change the type to False and True again
result.Fraud = result.Fraud.astype('boolean')
print(result)
customerEmail Fraud ID
0 name0 False 0
1 name1 True 1
3 name2 True 2
4 name3 True 3
5 name4 False 4
6 name1 False 1
0 name0 <NA> 0
3 name2 <NA> 2
4 name3 <NA> 3
5 name4 <NA> 4
7 name1 <NA> 5
import pandas as pd
df1 = pd.read_csv('1.csv')
df2 = pd.read_csv('2.csv')
out = pd.merge(df1, df2, on='customer_Email', how='left')
out.loc[~out['customer_Email'].isin(df2.drop_duplicates(subset='customer_Email', keep=False)['customer_Email'].tolist()), 'Fraud'] = None
out
customer_Email Fraud ID
0 name_0 0.0 0
1 name_1 NaN 1
2 name_1 NaN 5
3 name_2 1.0 2
4 name_3 1.0 3
5 name_4 0.0 4
6 name_1 NaN 1
7 name_1 NaN 5