Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/294.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/ember.js/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 键列具有重复的值。I';我正在尝试合并数据帧_Python_Pandas_Join_Merge - Fatal编程技术网

Python 键列具有重复的值。I';我正在尝试合并数据帧

Python 键列具有重复的值。I';我正在尝试合并数据帧,python,pandas,join,merge,Python,Pandas,Join,Merge,我想在customerEmail上将DF1与DF2合并,这在两者中都很常见,但具有重复值。 DF1 DF2 请注意,DF1和DF2仅用于示例目的 customerEmail在这两个数据帧中不是唯一的 因此,当我使用pd.merge(DF1,DF2,on='customerEmail',how=left)连接这两个表时,当相同的customerEmail重复时,它会用上面一行的随机值填充我的目标列Fraud 当customerEmail中存在重复项时,我希望我的Fraud列具有空值 custome

我想在customerEmail上将
DF1
DF2
合并,这在两者中都很常见,但具有重复值。 DF1

DF2

请注意,DF1和DF2仅用于示例目的

customerEmail
在这两个数据帧中不是唯一的

因此,当我使用
pd.merge(DF1,DF2,on='customerEmail',how=left)
连接这两个表时,当相同的
customerEmail
重复时,它会用上面一行的随机值填充我的目标列
Fraud

customerEmail
中存在重复项时,我希望我的
Fraud
列具有空值

customer\u电子邮件
在两个数据帧中都不是唯一的

电流输出:


customer_Email ID     Fraud
 name_0        0      False
 name_1        1      True
 name_2        2      True
 name_3        3      True   
 name_4        4      False
 name_1        5      True
 name_2        0      True
 name_1        1      True
 name_3        2      True
预期产出:

customer_Email ID     Fraud
 name_0        0      False
 name_1        1      True
 name_2        2      True
 name_3        3      True   
 name_4        4      False
 name_1        5      N/A
 name_2        0      N/A
 name_1        1      N/A
 name_3        2      N/A
下面呢?(假设客户电子邮件在df2中是唯一的):

输出:

    customer_Email  Fraud   ID
0   name_0          False   0
1   name_1          True    1
2   name_1          N/A     5
3   name_2          True    2
4   name_3          True    3
5   name_4          False   4
6   name_1          N/A     1  
7   name_1          N/A     5
“当CustomerMail中存在重复项时,希望我的欺诈列具有空值。”

因此,在预期的输出中,您忘记在
customerEmail
中添加
name\u 4
,因为它也是重复的

 df1 = pd.DataFrame({
    'customerEmail':['name0','name1','name2','name3','name4','name1'],
    'Fraud':[False,True,True,True,False,False]
}
                  )
df2 = pd.DataFrame({
    'customerEmail': ['name0', 'name1', 'name2', 'name3', 'name4', 'name1'],
    'ID':[0,1,2,3,4,5]
})


df3=pd.merge(df1, df2, on='customerEmail', how='left')

#here you need to know which customers are duplicated, to fill for them rows in column Fraud
df_duplicates = df3.drop_duplicates(subset=['customerEmail'],keep='last')
print(df_duplicates)
  customerEmail  Fraud  ID
0         name0  False   0
3         name2   True   2
4         name3   True   3
5         name4  False   4
7         name1  False   5
#now for those duplicates fill cells in column Fraud using iloc and np.nan
df_duplicates.loc[:,'Fraud'] = np.nan
print(df_duplicates)
  customerEmail  Fraud  ID
0         name0    NaN   0
3         name2    NaN   2
4         name3    NaN   3
5         name4    NaN   4
7         name1    NaN   5
#so now you have two df's , one df_duplicates with Nans duplicates values above,
#and main df3 with original merged values

#now you need to add those df's using concat , (add column to column )
#but you dont need values with same customerEmail that you used for df_duplicated, so you can delete them using drop_duplicates
result = pd.concat([df3,df_duplicates]).drop_duplicates(subset=['customerEmail','Fraud'])
#after concat True and False values has been coverted to 1.0 and 0 , for we need to change the type to False and True again
result.Fraud = result.Fraud.astype('boolean')
print(result)
  customerEmail  Fraud  ID
0         name0  False   0
1         name1   True   1
3         name2   True   2
4         name3   True   3
5         name4  False   4
6         name1  False   1
0         name0   <NA>   0
3         name2   <NA>   2
4         name3   <NA>   3
5         name4   <NA>   4
7         name1   <NA>   5
df1=pd.DataFrame({
'customerEmail':['name0'、'name1'、'name2'、'name3'、'name4'、'name1'],
“欺诈”:[假,真,真,真,假,假]
}
)
df2=pd.DataFrame({
“customerEmail”:[“名称0”、“名称1”、“名称2”、“名称3”、“名称4”、“名称1]”,
“ID”:[0,1,2,3,4,5]
})
df3=pd.merge(df1,df2,on='customerEmail',how='left')
#在这里,您需要知道哪些客户是重复的,以便在欺诈列中为他们填写行
df_duplicates=df3.drop_duplicates(子集=['customerEmail'],keep='last')
打印(df_副本)
客户邮件欺诈ID
0名称0错误0
3姓名2真实姓名2
4姓名3真实3
5姓名4假4
7姓名1假5
#现在,对于这些重复项,使用iloc和np.nan在列欺诈中填充单元格
df_duplicates.loc[:,'Fraud']=np.nan
打印(df_副本)
客户邮件欺诈ID
0名称0 NaN 0
3姓名2楠2
4姓名3南3
5姓名4南4
7姓名1南5
#现在有两个df,一个df_重复,上面有Nans重复值,
#和具有原始合并值的主df3
#现在您需要使用concat添加这些df(将列添加到列)
#但是,您不需要使用与df_duplicated相同的customerEmail值,因此您可以使用drop_duplicates删除它们
结果=pd.concat([df3,df_duplicates])。删除_duplicates(子集=['customerEmail','Fraud']))
#concat True和False值转换为1.0和0后,因为我们需要再次将类型更改为False和True
result.Fraud=result.Fraud.astype('boolean')
打印(结果)
客户邮件欺诈ID
0名称0错误0
1名称1正确1
3姓名2真实姓名2
4姓名3真实3
5姓名4假4
6姓名1假1
0名称0
3姓名2
4姓名3 3
5姓名4
7姓名15
给出:

    customer_Email  Fraud   ID
0   name_0  0.0 0
1   name_1  NaN 1
2   name_1  NaN 5
3   name_2  1.0 2
4   name_3  1.0 3
5   name_4  0.0 4
6   name_1  NaN 1
7   name_1  NaN 5
您可以使用
keep=False
功能获取df1和/或df2中的重复电子邮件。
下面是代码,用于在df1或df2中为任何具有重复电子邮件的行设置N/A

df=pd.merge(DF1,DF2,on='customerEmail',how='left')
dups_1=set(DF1.customerEmail[DF1.customerEmail.duplicated(keep=False)])#在DF1中获取重复的电子邮件
dups_2=set(DF2.customerEmail[DF2.customerEmail.duplicated(keep=False)])#在DF2中获取重复的电子邮件
dups=dups_1.union(dups_2)#获取df1或df2中的重复电子邮件(您也可以仅使用dups_1或dups_2)
df[“欺诈”]=df.apply(lambda行:如果行为DUP中的customerEmail,则为“N/A”,否则为行。欺诈,axis=1)#如果电子邮件为DUP,则为“N/A”

欢迎来到stack!你能把你的数据帧作为内联代码发布吗?这样人们就可以直接复制它。您还可以阅读如何创建。您的代码当前返回的数据帧是什么?您能否更具体一些,给我们一些具有所需输出的示例数据,而不是屏幕截图?也许您可以左键将DF1连接到DF2
pd.merge(DF2,DF1,on='customerEmail',how=left)
,但当存在重复的customerEmail时,这不会给您“空值”欢迎使用。首先,请编辑您的问题并删除指向外部站点的链接。将表格作为代码发布到网站上,而不是链接到某一天可能会消失的外部图像,这样用户就无法理解这个问题了。也就是说,请发布一个预期输出。您希望结果表的外观如何?合并在合并列中有重复条目的数据上不能很好地工作是有原因的,如果显示预期的输出,如果在
DF1
had
Fraud==False
中第一次出现
name_1
,第二次出现
True
,则将数据作为文本而不是图像发布。。。您是否希望合并df中的第一个(
False
)出现?或者
True
如果df1中的任何一行为True?感谢您的尝试谢谢您只需将那些不是第一行的
欺诈
值替换为
“N/A”
谢谢
    customer_Email  Fraud   ID
0   name_0          False   0
1   name_1          True    1
2   name_1          N/A     5
3   name_2          True    2
4   name_3          True    3
5   name_4          False   4
6   name_1          N/A     1  
7   name_1          N/A     5
 df1 = pd.DataFrame({
    'customerEmail':['name0','name1','name2','name3','name4','name1'],
    'Fraud':[False,True,True,True,False,False]
}
                  )
df2 = pd.DataFrame({
    'customerEmail': ['name0', 'name1', 'name2', 'name3', 'name4', 'name1'],
    'ID':[0,1,2,3,4,5]
})


df3=pd.merge(df1, df2, on='customerEmail', how='left')

#here you need to know which customers are duplicated, to fill for them rows in column Fraud
df_duplicates = df3.drop_duplicates(subset=['customerEmail'],keep='last')
print(df_duplicates)
  customerEmail  Fraud  ID
0         name0  False   0
3         name2   True   2
4         name3   True   3
5         name4  False   4
7         name1  False   5
#now for those duplicates fill cells in column Fraud using iloc and np.nan
df_duplicates.loc[:,'Fraud'] = np.nan
print(df_duplicates)
  customerEmail  Fraud  ID
0         name0    NaN   0
3         name2    NaN   2
4         name3    NaN   3
5         name4    NaN   4
7         name1    NaN   5
#so now you have two df's , one df_duplicates with Nans duplicates values above,
#and main df3 with original merged values

#now you need to add those df's using concat , (add column to column )
#but you dont need values with same customerEmail that you used for df_duplicated, so you can delete them using drop_duplicates
result = pd.concat([df3,df_duplicates]).drop_duplicates(subset=['customerEmail','Fraud'])
#after concat True and False values has been coverted to 1.0 and 0 , for we need to change the type to False and True again
result.Fraud = result.Fraud.astype('boolean')
print(result)
  customerEmail  Fraud  ID
0         name0  False   0
1         name1   True   1
3         name2   True   2
4         name3   True   3
5         name4  False   4
6         name1  False   1
0         name0   <NA>   0
3         name2   <NA>   2
4         name3   <NA>   3
5         name4   <NA>   4
7         name1   <NA>   5
import pandas as pd

df1 = pd.read_csv('1.csv')
df2 = pd.read_csv('2.csv')

out = pd.merge(df1, df2, on='customer_Email', how='left')
out.loc[~out['customer_Email'].isin(df2.drop_duplicates(subset='customer_Email', keep=False)['customer_Email'].tolist()), 'Fraud'] = None
out
    customer_Email  Fraud   ID
0   name_0  0.0 0
1   name_1  NaN 1
2   name_1  NaN 5
3   name_2  1.0 2
4   name_3  1.0 3
5   name_4  0.0 4
6   name_1  NaN 1
7   name_1  NaN 5