将字符串中的字与R中的变量匹配_R_String_Comparison

将字符串中的字与R中的变量匹配

r string

将字符串中的字与R中的变量匹配,r,string,comparison,R,String,Comparison,我有一个数据集，如下所示： cp<-data.frame("name"=c("billy", "jean", "jean", "billy","billy", "dawn", "dawn"), "answer"=c("michael jackson is my favorite", "I like flowers", "flower is red","hey michael", "do not touch me michael","i am a girl","girls have hai

我有一个数据集，如下所示：

cp<-data.frame("name"=c("billy", "jean", "jean", "billy","billy", "dawn", "dawn"), 
"answer"=c("michael jackson is my favorite", "I like flowers", "flower is red","hey michael",
"do not touch me michael","i am a girl","girls have hair"))

它给出了输出“debby dallas”

有一点需要注意，那就是要正确理解，我想这就是你要找的。正如David提到的，它不处理单词的复数形式。这只会找到完全相同的单词

billyAnswers<-cp$answer[cp$name=="billy"]
#output of billyAnswers
#[1] "michael jackson is my favorite" "hey michael"                   
#[3] "do not touch me michael"

好了，把它应用到所有的名字上，你就知道了

对于jean和dawn，他们的答案中没有常用词，因此此方法返回两个长度为0的字符向量

#jean's words
#[1] "I"       "like"    "flowers" "flower"  "is"      "red" 

#dawn's words
#[1] "i"     "am"    "a"     "girl"  "girls" "have"  "hair"

在正确理解的前提下，我想这就是你想要的。正如David提到的，它不处理单词的复数形式。这只会找到完全相同的单词

billyAnswers<-cp$answer[cp$name=="billy"]
#output of billyAnswers
#[1] "michael jackson is my favorite" "hey michael"                   
#[3] "do not touch me michael"

好了，把它应用到所有的名字上，你就知道了

对于jean和dawn，他们的答案中没有常用词，因此此方法返回两个长度为0的字符向量

#jean's words
#[1] "I"       "like"    "flowers" "flower"  "is"      "red" 

#dawn's words
#[1] "i"     "am"    "a"     "girl"  "girls" "have"  "hair"

下面是我制作的一个（不是很有效）函数，它使用

pmatch

来匹配部分匹配。它的问题是，它还将匹配

和

am

或

和

，因为它们也非常接近
freqFunc <- function(x){
  temp <- tolower(unlist(strsplit(as.character(x), " ")))
  temp2 <- length(temp)
  temp3 <- lapply(temp, function(x){
    temp4 <- na.omit(temp[pmatch(rep(x, temp2), temp)])
    temp4[length(temp4) > 1]
  })
  list(unique(unlist(temp3))) 
}

library(data.table)
setDT(cp)[, lapply(.SD, freqFunc), by = name, .SDcols = "answer"]
#     name              answer
# 1: billy             michael
# 2:  jean i,is,flower,flowers
# 3:  dawn     a,am,girl,girls

下面是我制作的一个（不是很有效）函数，它使用pmatch
来匹配部分匹配。它的问题是，它还将匹配a
和am
或i
和，因为它们也非常接近
freqFunc <- function(x){
  temp <- tolower(unlist(strsplit(as.character(x), " ")))
  temp2 <- length(temp)
  temp3 <- lapply(temp, function(x){
    temp4 <- na.omit(temp[pmatch(rep(x, temp2), temp)])
    temp4[length(temp4) > 1]
  })
  list(unique(unlist(temp3))) 
}

library(data.table)
setDT(cp)[, lapply(.SD, freqFunc), by = name, .SDcols = "answer"]
#     name              answer
# 1: billy             michael
# 2:  jean i,is,flower,flowers
# 3:  dawn     a,am,girl,girls

第一列中没有“michael”…@VincentGuillemot，是的。@ErosRam，您的第一个期望输出很容易实现，问题是如果没有外部NLP包，很难识别复数，这可能会导致混乱。请更具体地说明您想要什么（示例），并发布期望的结果。@davidernburg:那太糟糕了。我也对复数感兴趣。甚至可能将拼写错误的单词与拼写正确的对应词进行匹配。第一列中没有“michael”。@VincentGuillemot，是的，有。@ErosRam，您的第一个期望输出很容易实现，问题是没有外部NLP包很难识别复数，哪一个会让事情变得一团糟？请你详细说明你想要什么（举个例子），并公布你想要的结果。@Davidernburg:那太糟糕了。我也对复数感兴趣。甚至可能把拼写错误的单词拼合到拼写正确的单词上。实际上，我猜OP会把“花”和“花”看成是共同的。“女孩”和“女孩”也是一样，是的，我没有处理这个问题（我保证说得很清楚，所以没有任何混乱），所以如果这是OP想要的，希望有人会介入@DMT感谢您的贡献。我很感激。我投了赞成票。如果我没有得到任何其他的答案，我会接受。我确实想要复数，但如果不让事情变得非常复杂，那我就满足了@ErasRAM是的，我想我有一个想法，有点杂乱无章的方式做部分匹配，将不是特别有效，但如果我有时间，我会张贴，其实我猜，OP将要考虑“花”和“花”一样常见。“女孩”和“女孩”也是一样，是的，我没有处理这个问题（我保证说得很清楚，所以没有任何混乱），所以如果这是OP想要的，希望有人会介入@DMT感谢您的贡献。我很感激。我投了赞成票。如果我没有得到任何其他的答案，我会接受。我确实想要复数，但如果不让事情变得非常复杂，那我就满足了@ErosRam是的，我想我有一个想法，可以用一种简单的方法来实现部分匹配，不会特别有效，但如果我有时间，我会发布它。这太棒了！非常感谢你！在“rep（x，temp2）”中使用temp2有什么具体原因吗？rep（x，2）是否可以正常工作？这似乎使编码器速度更快。temp2
是temp
的长度，它是可变的，并不总是等于2lappy，它将逐个获取向量“temp”中的每个元素，并在“lappy”函数中的指定函数中使用它。指定的函数使用“pmatch”，pmatch将“temp”的每个元素与“temp”的所有元素匹配。在我看来，我们不需要重复'temp'向量'temp2'次。只需要两次，对吗？我假设'freqFunc'中的参数是一个字符串，包含来自同一名称的所有答案。我想说的是，当我们只对匹配感兴趣，而不是对匹配数感兴趣时，将temp2
设置为2似乎就足够了。这太棒了！非常感谢你！在“rep（x，temp2）”中使用temp2有什么具体原因吗？rep（x，2）是否可以正常工作？这似乎使编码器速度更快。temp2
是temp
的长度，它是可变的，并不总是等于2lappy，它将逐个获取向量“temp”中的每个元素，并在“lappy”函数中的指定函数中使用它。指定的函数使用“pmatch”，pmatch将“temp”的每个元素与“temp”的所有元素匹配。在我看来，我们不需要重复'temp'向量'temp2'次。只需要两次，对吗？我假设'freqFunc'中的参数是一个字符串，包含来自同一名称的所有答案。我想说的是，当我们只对匹配感兴趣，而不是对匹配数感兴趣时，将temp2设置为2似乎就足够了。
freqFunc <- function(x){
  temp <- tolower(unlist(strsplit(as.character(x), " ")))
  temp2 <- length(temp)
  temp3 <- lapply(temp, function(x){
    temp4 <- na.omit(temp[pmatch(rep(x, temp2), temp)])
    temp4[length(temp4) > 1]
  })
  list(unique(unlist(temp3))) 
}

library(data.table)
setDT(cp)[, lapply(.SD, freqFunc), by = name, .SDcols = "answer"]
#     name              answer
# 1: billy             michael
# 2:  jean i,is,flower,flowers
# 3:  dawn     a,am,girl,girls

freqFunc2 <- function(x){
  temp <- table(tolower(unlist(strsplit(as.character(x), " "))))
  list(names(temp[temp > 1]))
}

library(data.table)
setDT(cp)[, lapply(.SD, freqFunc2), by = name, .SDcols = "answer"]
#     name  answer
# 1: billy michael
# 2:  jean        
# 3:  dawn