R 找到两个数据帧的匹配项,并将答案重写为数据帧
我有两个数据帧,它们被清理并合并为一个csv文件,数据帧如下R 找到两个数据帧的匹配项,并将答案重写为数据帧,r,fuzzy-logic,fuzzy-comparison,R,Fuzzy Logic,Fuzzy Comparison,我有两个数据帧,它们被清理并合并为一个csv文件,数据帧如下 **Source Master** chang chun petrochemical CHANG CHUN GROUP chang chun plastics CHURCH AND DWIGHT CO INC church dwight CITRIX SYSTEMS ASIA PACIFIC P L citrix
**Source Master**
chang chun petrochemical CHANG CHUN GROUP
chang chun plastics CHURCH AND DWIGHT CO INC
church dwight CITRIX SYSTEMS ASIA PACIFIC P L
citrix systems pacific CNH INDUSTRIAL N.V
现在,我必须考虑名字,并检查每个名字的名称,找到匹配的匹配项,并将输出打印为另一个数据帧。上面的数据帧很少,但我使用的是20k值
我的输出必须如下所示 **Source Master Result**
chang chun petrochemical CHANG CHUN GROUP CHANG CHUN GROUP
chang chun plastics CHURCH AND DWIGHT CO INC CHANG CHUN GROUP
church dwight CITRIX SYSTEMS ASIA PACIFIC P L CHURCH AND DWIGHT CO INC
citrix systems pacific CNH INDUSTRIAL N.V CITRIX SYSTEMS ASIA PACIFIC P L
我尝试了这个可能的方法与此链接,但没有运气到目前为止
提前感谢
当我对大量数据使用上述代码时,结果如下-
使用的代码:
Mast <- pmatch(Names$I_sender_O_Receiver_Customer, Master.Names$MOD, nomatch=NA)
Mast <- sapply(Names$I_sender_O_Receiver_Customer, function(x) {
agrep(x, Master.Names$MOD,value=TRUE) })
[[1]]
character(0)
[[2]]
character(0)
[[3]]
[1] " CHURCH AND DWIGHT CO INC"
[[4]]
[1] " CITRIX SYSTEMS ASIA PACIFIC P L"
[[5]]
character(0)
for(i in seq_len(nrow(df$ICIS_Cust_Names)))
{
df$reslt[i] <- grep(x = str_split(df$ICIS_Cust_Names[i]," ")[[1]][1], df$Master_Names[i],value=TRUE)
}
print(df$reslt)
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
代码:
Mast <- pmatch(Names$I_sender_O_Receiver_Customer, Master.Names$MOD, nomatch=NA)
Mast <- sapply(Names$I_sender_O_Receiver_Customer, function(x) {
agrep(x, Master.Names$MOD,value=TRUE) })
[[1]]
character(0)
[[2]]
character(0)
[[3]]
[1] " CHURCH AND DWIGHT CO INC"
[[4]]
[1] " CITRIX SYSTEMS ASIA PACIFIC P L"
[[5]]
character(0)
for(i in seq_len(nrow(df$ICIS_Cust_Names)))
{
df$reslt[i] <- grep(x = str_split(df$ICIS_Cust_Names[i]," ")[[1]][1], df$Master_Names[i],value=TRUE)
}
print(df$reslt)
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
即使使用for循环,也不会产生任何结果
代码:
Mast <- pmatch(Names$I_sender_O_Receiver_Customer, Master.Names$MOD, nomatch=NA)
Mast <- sapply(Names$I_sender_O_Receiver_Customer, function(x) {
agrep(x, Master.Names$MOD,value=TRUE) })
[[1]]
character(0)
[[2]]
character(0)
[[3]]
[1] " CHURCH AND DWIGHT CO INC"
[[4]]
[1] " CITRIX SYSTEMS ASIA PACIFIC P L"
[[5]]
character(0)
for(i in seq_len(nrow(df$ICIS_Cust_Names)))
{
df$reslt[i] <- grep(x = str_split(df$ICIS_Cust_Names[i]," ")[[1]][1], df$Master_Names[i],value=TRUE)
}
print(df$reslt)
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
结果:
Mast <- pmatch(Names$I_sender_O_Receiver_Customer, Master.Names$MOD, nomatch=NA)
Mast <- sapply(Names$I_sender_O_Receiver_Customer, function(x) {
agrep(x, Master.Names$MOD,value=TRUE) })
[[1]]
character(0)
[[2]]
character(0)
[[3]]
[1] " CHURCH AND DWIGHT CO INC"
[[4]]
[1] " CITRIX SYSTEMS ASIA PACIFIC P L"
[[5]]
character(0)
for(i in seq_len(nrow(df$ICIS_Cust_Names)))
{
df$reslt[i] <- grep(x = str_split(df$ICIS_Cust_Names[i]," ")[[1]][1], df$Master_Names[i],value=TRUE)
}
print(df$reslt)
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
错误
Error in `$<-.data.frame`(`*tmp*`, "x", value = c(NA, NA, " church dwight " :
replacement has 3 rows, data has 100
“$中的
错误如果您想检查主名称。名称仅与名称中的第一个单词相对应,这可以实现以下目的:
Names$Mast <- NA
for(i in seq_len(nrow(Names)))
Names$Mast[i] <- grep(toupper(x = strsplit(Names[i,1]," ")[[1]][1]), Master.Names$V1,value=TRUE)
数据
Master.Names <- read.csv(text="CHANG CHUN GROUP
CHURCH AND DWIGHT CO INC
CITRIX SYSTEMS ASIA PACIFIC P L
CNH INDUSTRIAL N.V", header=FALSE)
Names <- read.csv(text="chang chun petrochemical
chang chun plastics
church dwight
citrix systems pacific", header=FALSE)
Master.Names可能是Master.Names$V1
?如果是这样,请尝试使用Master.Names[,1]
来代替。它在这个toupper上的抛出错误(x=strsplit(Names[i,1],“”)[[1]][1])。我可以使用stringdist
的amatch
来匹配上面的内容吗?我是否可以提供数据帧值,以便amatch
不会引发错误!!我的回答假设您的数据帧被命名为Names
和Master.Names
。。。是这样吗?或者您的变量可能是因子,在这种情况下,您需要使用as.character()
将它们转换为字符串。仅对于示例数据,我将其命名为cust。名称和主名称,但我处理的是巨大的数据帧,在这种情况下,应用的逻辑并不适用。