Regex 性能-如何在与R中给定句子匹配的单词列表中获取这些单词_Regex_R

Regex 性能-如何在与R中给定句子匹配的单词列表中获取这些单词

regex r

Regex 性能-如何在与R中给定句子匹配的单词列表中获取这些单词,regex,r,Regex,R,我试图只从列表中获取给定句子中出现的单词。这些单词也可以包括双格词。比如说, wordList <- c("really good","better","awesome","true","happy") sentence <- c("This is a really good program but it can be made better by making it more efficient") 我有1000个这样的句子，我需要在上面比较单词。单词列表也更大。我尝试了使用gr

我试图只从列表中获取给定句子中出现的单词。这些单词也可以包括双格词。比如说,

wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")

我有1000个这样的句子，我需要在上面比较单词。单词列表也更大。我尝试了使用grep命令的暴力方法，但它花费了很多时间（正如预期的那样）。我希望以性能更好的方式获得匹配的单词。
require（dplyr） require(dplyr) wordList <- c("really good","better","awesome","true","happy") sentence <- c("This is a really good program but it can be made better by making it more efficient") # get unigrams from the sentence unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE)) # get bigrams from the sentence bigrams <- unlist(lapply(1:length(words)-1, function(i) {paste(words[i],words[i+1])} )) # .. and combine into data frame grams <- data.frame(grams=c(unigrams, bigrams), stringsAsFactors = FALSE) # dplyr join should be pretty efficient matches <- inner_join(data.frame(wordList, stringsAsFactors = FALSE), grams, by=c('wordList'='grams')) matches wordList 1 really good 2 better 单词列表require（dplyr）单词表我能够使用@epi99的答案，只需稍加修改 wordList <- c("really good","better","awesome","true","happy") sentence <- c("This is a really good program but it can be made better by making it more efficient") # get unigrams from the sentence unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE)) # get bigrams from the sentence bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} )) # .. and combine into a single vector grams=c(unigrams, bigrams) # use match function to get the matching words matches <- match(grams, wordList ) matches <- na.omit(matches) matchingwords <- wordList[matches] wordList我能够使用@epi99的答案，只需稍加修改 wordList <- c("really good","better","awesome","true","happy") sentence <- c("This is a really good program but it can be made better by making it more efficient") # get unigrams from the sentence unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE)) # get bigrams from the sentence bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} )) # .. and combine into a single vector grams=c(unigrams, bigrams) # use match function to get the matching words matches <- match(grams, wordList ) matches <- na.omit(matches) matchingwords <- wordList[matches] wordList怎么样 unlist(sapply(wordList, function(x) grep(x, sentence))) 那怎么办 unlist(sapply(wordList, function(x) grep(x, sentence))) 尝试stringr 包（或stringi 哪个stringr 包装）。例如wordList[stringr:：str_detect（句子，单词列表）] @nicola这种方法也需要很多时间尝试stringr 包（或stringi 其中stringr 包装）。例如，wordList[stringr:：str_detect（句子，单词列表）] @nicola这种方法也需要很多时间显然@nicola的解决方案比我的大锤要好得多！@epi99我能够使用您的解决方案，使用“匹配”函数代替内部连接，因为它更快。我在下面给出了我的解决方案，但接受了你的回答，因为它帮助了我。显然，@nicola的解决方案比我的大锤要好得多！@epi99我能够使用您的解决方案，使用“匹配”函数代替内部连接，因为它更快。我在下面给出了我的解决方案，但接受了你的回答，因为它帮助了我。