Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/shell/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 性能-如何在与R中给定句子匹配的单词列表中获取这些单词_Regex_R - Fatal编程技术网

Regex 性能-如何在与R中给定句子匹配的单词列表中获取这些单词

Regex 性能-如何在与R中给定句子匹配的单词列表中获取这些单词,regex,r,Regex,R,我试图只从列表中获取给定句子中出现的单词。这些单词也可以包括双格词。比如说, wordList <- c("really good","better","awesome","true","happy") sentence <- c("This is a really good program but it can be made better by making it more efficient") 我有1000个这样的句子,我需要在上面比较单词。单词列表也更大。我尝试了使用gr

我试图只从列表中获取给定句子中出现的单词。这些单词也可以包括双格词。比如说,

wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
我有1000个这样的句子,我需要在上面比较单词。单词列表也更大。我尝试了使用grep命令的暴力方法,但它花费了很多时间(正如预期的那样)。我希望以性能更好的方式获得匹配的单词。

require(dplyr)
require(dplyr)
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")

# get  unigrams  from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))

# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(words)-1, function(i) {paste(words[i],words[i+1])} ))

# .. and combine into data frame
grams <- data.frame(grams=c(unigrams, bigrams), stringsAsFactors = FALSE)

# dplyr join should be pretty efficient
matches <- inner_join(data.frame(wordList, stringsAsFactors = FALSE), 
                      grams,
                      by=c('wordList'='grams'))

matches
     wordList
1 really good
2      better
单词列表
require(dplyr)

单词表我能够使用@epi99的答案,只需稍加修改

wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")

# get  unigrams  from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))

# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))

# .. and combine into a single vector

grams=c(unigrams, bigrams)

# use match function to get the matching words

matches <- match(grams, wordList )
matches <- na.omit(matches)
matchingwords <- wordList[matches]

wordList我能够使用@epi99的答案,只需稍加修改

wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")

# get  unigrams  from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))

# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))

# .. and combine into a single vector

grams=c(unigrams, bigrams)

# use match function to get the matching words

matches <- match(grams, wordList )
matches <- na.omit(matches)
matchingwords <- wordList[matches]
wordList怎么样

unlist(sapply(wordList, function(x) grep(x, sentence)))
那怎么办

unlist(sapply(wordList, function(x) grep(x, sentence)))

尝试
stringr
包(或
stringi
哪个
stringr
包装)。例如
wordList[stringr::str_detect(句子,单词列表)]
@nicola这种方法也需要很多时间尝试
stringr
包(或
stringi
其中
stringr
包装)。例如,
wordList[stringr::str_detect(句子,单词列表)]
@nicola这种方法也需要很多时间显然@nicola的解决方案比我的大锤要好得多!@epi99我能够使用您的解决方案,使用“匹配”函数代替内部连接,因为它更快。我在下面给出了我的解决方案,但接受了你的回答,因为它帮助了我。显然,@nicola的解决方案比我的大锤要好得多!@epi99我能够使用您的解决方案,使用“匹配”函数代替内部连接,因为它更快。我在下面给出了我的解决方案,但接受了你的回答,因为它帮助了我。