在r中匹配不同数据帧中两个以上单词的单词_R_Dataframe_Textmatching

在r中匹配不同数据帧中两个以上单词的单词

r dataframe

在r中匹配不同数据帧中两个以上单词的单词,r,dataframe,textmatching,R,Dataframe,Textmatching,我有两个像这样的数据帧DF1和DF2 ID = c(1, 2, 3, 4) Issues = c('Issue1, Issue4', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5') Location = c('x', 'y', 'z', 'w') Customer = c('a', 'b', 'c', 'd') DF1 = data.frame(ID, Issues, Location, Customer) Root

我有两个像这样的数据帧DF1和DF2

ID = c(1, 2, 3, 4) 
Issues = c('Issue1, Issue4', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
DF1 = data.frame(ID, Issues, Location, Customer)

Root_Cause = c('R1', 'R2', 'R3', 'R4')
List_of_Issues = c('Issue1, Issue3, Issue5', 'Issue2, Issue1, Issue4', 'Issue6, Issue7', 'Issue5, Issue6')  
DF2 = data.frame(Root_Cause, List_of_Issues)

我想将数据帧与DF1的“问题”和DF2的“问题列表”进行比较，如果DF2的“问题列表”列中有两个以上的单词，那么我想从DF2填充后续的“根本原因”。生成的数据帧应该类似于DF3

ID = c(1, 2, 3, 4)
Issues = c('Issue1, Issue4', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
Root_Cause = c('R2', 'R4', NA, 'R1')
DF3 = data.frame(ID, Issues, Location, Customer, Root_Cause)

使用data.table：

编辑：我已编辑了您的样本数据，以说明多个根本原因的可能性。在该数据中，

ID==1

对应于R2和R3

数据

ID = c(1, 2, 3, 4) 
Issues = c('Issue1, Issue4, Issue6, Issue7', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
DF1 = data.table(ID, Issues, Location, Customer)

Root_Cause = c('R1', 'R2', 'R3', 'R4')
List_of_Issues = c('Issue1, Issue3, Issue5', 'Issue2, Issue1, Issue4', 'Issue6, Issue7', 'Issue5, Issue6')  
DF2 = data.table(Root_Cause, List_of_Issues)

代码

DF1[, Issues := strsplit(Issues, split = ', ')]
DF2[, List_of_Issues := strsplit(List_of_Issues, split = ', ')]

DF1[, RootCause := lapply(Issues, function(x){

  matchvec = sapply(DF2[, List_of_Issues], function(y) length(unlist(intersect(y, x))))
  ids = which(matchvec > 1)
  str = DF2[, paste(Root_Cause[ids], collapse = ', ')]

  ifelse(str == '', NA, str)

})]

结果

> DF1
   ID                      Issues Location Customer RootCause
1:  1 Issue1,Issue4,Issue6,Issue7        x        a    R2, R3
2:  2        Issue2,Issue5,Issue6        y        b        R4
3:  3               Issue3,Issue4        z        c        NA
4:  4               Issue1,Issue5        w        d        R1

使用data.table：

编辑：我已编辑了您的样本数据，以说明多个根本原因的可能性。在该数据中，

ID==1

对应于R2和R3

数据

ID = c(1, 2, 3, 4) 
Issues = c('Issue1, Issue4, Issue6, Issue7', 'Issue2, Issue5, Issue6', 'Issue3, Issue4', 'Issue1, Issue5')
Location = c('x', 'y', 'z', 'w')
Customer = c('a', 'b', 'c', 'd')
DF1 = data.table(ID, Issues, Location, Customer)

Root_Cause = c('R1', 'R2', 'R3', 'R4')
List_of_Issues = c('Issue1, Issue3, Issue5', 'Issue2, Issue1, Issue4', 'Issue6, Issue7', 'Issue5, Issue6')  
DF2 = data.table(Root_Cause, List_of_Issues)

代码

DF1[, Issues := strsplit(Issues, split = ', ')]
DF2[, List_of_Issues := strsplit(List_of_Issues, split = ', ')]

DF1[, RootCause := lapply(Issues, function(x){

  matchvec = sapply(DF2[, List_of_Issues], function(y) length(unlist(intersect(y, x))))
  ids = which(matchvec > 1)
  str = DF2[, paste(Root_Cause[ids], collapse = ', ')]

  ifelse(str == '', NA, str)

})]

结果

> DF1
   ID                      Issues Location Customer RootCause
1:  1 Issue1,Issue4,Issue6,Issue7        x        a    R2, R3
2:  2        Issue2,Issue5,Issue6        y        b        R4
3:  3               Issue3,Issue4        z        c        NA
4:  4               Issue1,Issue5        w        d        R1

请提供最少且可重复的示例以及所需的输出。对数据使用

dput（）

，并使用

library（）

调用指定所有非基本包。不要为数据或代码嵌入图片，而是使用缩进的代码块。是的，我已经这样做了。对不起。我是个新手。@Shiriam干得不错。我试图改进一下格式。现在，让我们看看是否有人可以帮助您。请提供最小的和可复制的示例以及所需的输出。对数据使用

dput（）

，并使用

library（）

调用指定所有非基本包。不要为数据或代码嵌入图片，而是使用缩进的代码块。是的，我已经这样做了。对不起。我是个新手。@Shiriam干得不错。我试图改进一下格式。现在，让我们看看是否有人可以帮助您。

DF2=DF2[，问题列表：=strsplit（问题列表，拆分='，'）]

希望我们在这里将问题列表列转换为向量。我们将每个问题组合字符串拆分为单独的字符串。非常感谢。解决方案运行良好，非常有用

DF1=数据。表（ID=c（1,2,3,4），Issues=c（'Issue1，Issue4'，'Issue2，Issue5，Issue6，Issue7'，'Issue3，Issue4'，'Issue1，Issue5'），Location=c（'x'，'y'，'z'，'w'），Customer=c（'a'，'b'，'c'，'d'））DF2=数据。表（根本原因=c（'R1'，'R2'，'R3'，'R4'）），问题列表=c（'Issue1，Issue3，Issue5'，'Issue2，Issue1，Issue4'，'Issue6，Issue7'，'Issue5，Issue6'））

假设我对上述一行问题有多个根本原因。是否可以按逗号分隔的根原因字符串列出根原因我现在已编辑了答案以说明多个根原因。

DF2=DF2[，问题列表：=strsplit（问题列表，拆分='，'）]

希望我们在这里将问题列表列转换为向量。我们将每个问题组合字符串拆分为单独的字符串。非常感谢。解决方案运行良好，非常有用

DF1=数据。表（ID=c（1,2,3,4），Issues=c（'Issue1，Issue4'，'Issue2，Issue5，Issue6，Issue7'，'Issue3，Issue4'，'Issue1，Issue5'），Location=c（'x'，'y'，'z'，'w'），Customer=c（'a'，'b'，'c'，'d'））DF2=数据。表（根本原因=c（'R1'，'R2'，'R3'，'R4'）），问题列表=c（'Issue1，Issue3，Issue5'，'Issue2，Issue1，Issue4'，'Issue6，Issue7'，'Issue5，Issue6'））

假设我对上述一行问题有多个根本原因。是否可以按逗号分隔的根原因字符串列出根原因我现在编辑了答案，以说明多个根原因。