Regex 性能-如何在与R中给定句子匹配的单词列表中获取这些单词
我试图只从列表中获取给定句子中出现的单词。这些单词也可以包括双格词。比如说,Regex 性能-如何在与R中给定句子匹配的单词列表中获取这些单词,regex,r,Regex,R,我试图只从列表中获取给定句子中出现的单词。这些单词也可以包括双格词。比如说, wordList <- c("really good","better","awesome","true","happy") sentence <- c("This is a really good program but it can be made better by making it more efficient") 我有1000个这样的句子,我需要在上面比较单词。单词列表也更大。我尝试了使用gr
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
我有1000个这样的句子,我需要在上面比较单词。单词列表也更大。我尝试了使用grep命令的暴力方法,但它花费了很多时间(正如预期的那样)。我希望以性能更好的方式获得匹配的单词。require(dplyr)
require(dplyr)
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(words)-1, function(i) {paste(words[i],words[i+1])} ))
# .. and combine into data frame
grams <- data.frame(grams=c(unigrams, bigrams), stringsAsFactors = FALSE)
# dplyr join should be pretty efficient
matches <- inner_join(data.frame(wordList, stringsAsFactors = FALSE),
grams,
by=c('wordList'='grams'))
matches
wordList
1 really good
2 better
单词列表require(dplyr)
单词表我能够使用@epi99的答案,只需稍加修改
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))
# .. and combine into a single vector
grams=c(unigrams, bigrams)
# use match function to get the matching words
matches <- match(grams, wordList )
matches <- na.omit(matches)
matchingwords <- wordList[matches]
wordList我能够使用@epi99的答案,只需稍加修改
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))
# .. and combine into a single vector
grams=c(unigrams, bigrams)
# use match function to get the matching words
matches <- match(grams, wordList )
matches <- na.omit(matches)
matchingwords <- wordList[matches]
wordList怎么样
unlist(sapply(wordList, function(x) grep(x, sentence)))
那怎么办
unlist(sapply(wordList, function(x) grep(x, sentence)))
尝试stringr
包(或stringi
哪个stringr
包装)。例如wordList[stringr::str_detect(句子,单词列表)]
@nicola这种方法也需要很多时间尝试stringr
包(或stringi
其中stringr
包装)。例如,wordList[stringr::str_detect(句子,单词列表)]
@nicola这种方法也需要很多时间显然@nicola的解决方案比我的大锤要好得多!@epi99我能够使用您的解决方案,使用“匹配”函数代替内部连接,因为它更快。我在下面给出了我的解决方案,但接受了你的回答,因为它帮助了我。显然,@nicola的解决方案比我的大锤要好得多!@epi99我能够使用您的解决方案,使用“匹配”函数代替内部连接,因为它更快。我在下面给出了我的解决方案,但接受了你的回答,因为它帮助了我。