R 找到两个数据帧的匹配项,并将答案重写为数据帧

R 找到两个数据帧的匹配项,并将答案重写为数据帧,r,fuzzy-logic,fuzzy-comparison,R,Fuzzy Logic,Fuzzy Comparison,我有两个数据帧,它们被清理并合并为一个csv文件,数据帧如下 **Source Master** chang chun petrochemical CHANG CHUN GROUP chang chun plastics CHURCH AND DWIGHT CO INC church dwight CITRIX SYSTEMS ASIA PACIFIC P L citrix

我有两个数据帧,它们被清理并合并为一个csv文件,数据帧如下

  **Source                         Master**

 chang chun petrochemical      CHANG CHUN GROUP
 chang chun plastics           CHURCH AND DWIGHT CO INC
 church  dwight                CITRIX SYSTEMS ASIA PACIFIC P L
 citrix systems  pacific       CNH INDUSTRIAL N.V

现在,我必须考虑名字,并检查每个名字的名称,找到匹配的匹配项,并将输出打印为另一个数据帧。上面的数据帧很少,但我使用的是20k值

我的输出必须如下所示

 **Source                         Master                         Result**

 chang chun petrochemical      CHANG CHUN GROUP                 CHANG CHUN GROUP
 chang chun plastics           CHURCH AND DWIGHT CO INC         CHANG CHUN GROUP
 church  dwight                CITRIX SYSTEMS ASIA PACIFIC P L  CHURCH AND DWIGHT CO INC
 citrix systems  pacific       CNH INDUSTRIAL N.V               CITRIX SYSTEMS ASIA PACIFIC P L
我尝试了这个可能的方法与此链接,但没有运气到目前为止

提前感谢

当我对大量数据使用上述代码时,结果如下-

使用的代码:

Mast <- pmatch(Names$I_sender_O_Receiver_Customer, Master.Names$MOD, nomatch=NA)
Mast <- sapply(Names$I_sender_O_Receiver_Customer, function(x) {
   agrep(x, Master.Names$MOD,value=TRUE) })
[[1]]
character(0)

[[2]]
character(0)

[[3]]
[1] " CHURCH AND DWIGHT CO INC"

[[4]]
[1] " CITRIX SYSTEMS ASIA PACIFIC P L"

[[5]]
character(0)
for(i in seq_len(nrow(df$ICIS_Cust_Names)))
  {
    df$reslt[i] <- grep(x = str_split(df$ICIS_Cust_Names[i]," ")[[1]][1], df$Master_Names[i],value=TRUE)
  }
  print(df$reslt)
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
代码:

Mast <- pmatch(Names$I_sender_O_Receiver_Customer, Master.Names$MOD, nomatch=NA)
Mast <- sapply(Names$I_sender_O_Receiver_Customer, function(x) {
   agrep(x, Master.Names$MOD,value=TRUE) })
[[1]]
character(0)

[[2]]
character(0)

[[3]]
[1] " CHURCH AND DWIGHT CO INC"

[[4]]
[1] " CITRIX SYSTEMS ASIA PACIFIC P L"

[[5]]
character(0)
for(i in seq_len(nrow(df$ICIS_Cust_Names)))
  {
    df$reslt[i] <- grep(x = str_split(df$ICIS_Cust_Names[i]," ")[[1]][1], df$Master_Names[i],value=TRUE)
  }
  print(df$reslt)
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
即使使用for循环,也不会产生任何结果

代码:

Mast <- pmatch(Names$I_sender_O_Receiver_Customer, Master.Names$MOD, nomatch=NA)
Mast <- sapply(Names$I_sender_O_Receiver_Customer, function(x) {
   agrep(x, Master.Names$MOD,value=TRUE) })
[[1]]
character(0)

[[2]]
character(0)

[[3]]
[1] " CHURCH AND DWIGHT CO INC"

[[4]]
[1] " CITRIX SYSTEMS ASIA PACIFIC P L"

[[5]]
character(0)
for(i in seq_len(nrow(df$ICIS_Cust_Names)))
  {
    df$reslt[i] <- grep(x = str_split(df$ICIS_Cust_Names[i]," ")[[1]][1], df$Master_Names[i],value=TRUE)
  }
  print(df$reslt)
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
结果:

Mast <- pmatch(Names$I_sender_O_Receiver_Customer, Master.Names$MOD, nomatch=NA)
Mast <- sapply(Names$I_sender_O_Receiver_Customer, function(x) {
   agrep(x, Master.Names$MOD,value=TRUE) })
[[1]]
character(0)

[[2]]
character(0)

[[3]]
[1] " CHURCH AND DWIGHT CO INC"

[[4]]
[1] " CITRIX SYSTEMS ASIA PACIFIC P L"

[[5]]
character(0)
for(i in seq_len(nrow(df$ICIS_Cust_Names)))
  {
    df$reslt[i] <- grep(x = str_split(df$ICIS_Cust_Names[i]," ")[[1]][1], df$Master_Names[i],value=TRUE)
  }
  print(df$reslt)
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
错误

Error in `$<-.data.frame`(`*tmp*`, "x", value = c(NA, NA, " church  dwight  " : 
  replacement has 3 rows, data has 100

“$中的
错误如果您想检查主名称。名称仅与名称中的第一个单词相对应,这可以实现以下目的:

Names$Mast <- NA
for(i in seq_len(nrow(Names))) 
    Names$Mast[i] <- grep(toupper(x = strsplit(Names[i,1]," ")[[1]][1]), Master.Names$V1,value=TRUE)
数据

Master.Names <- read.csv(text="CHANG CHUN GROUP
CHURCH AND DWIGHT CO INC
CITRIX SYSTEMS ASIA PACIFIC P L
CNH INDUSTRIAL N.V", header=FALSE)

Names <- read.csv(text="chang chun petrochemical
chang chun plastics     
church dwight          
citrix systems pacific", header=FALSE)

Master.Names可能是
Master.Names$V1
?如果是这样,请尝试使用
Master.Names[,1]
来代替。它在这个toupper上的抛出错误(x=strsplit(Names[i,1],“”)[[1]][1])。我可以使用
stringdist
amatch
来匹配上面的内容吗?我是否可以提供数据帧值,以便
amatch
不会引发错误!!我的回答假设您的数据帧被命名为
Names
Master.Names
。。。是这样吗?或者您的变量可能是因子,在这种情况下,您需要使用
as.character()
将它们转换为字符串。仅对于示例数据,我将其命名为cust。名称和主名称,但我处理的是巨大的数据帧,在这种情况下,应用的逻辑并不适用。