Regex 计算一个列表中出现在字符串中的字数_Regex_R_String_Grepl

Regex 计算一个列表中出现在字符串中的字数

regex r string

Regex 计算一个列表中出现在字符串中的字数,regex,r,string,grepl,Regex,R,String,Grepl,我在一个字符向量中有一组独特的单词（已经“词干化”），我想知道它们中有多少出现在一个字符串中以下是我目前掌握的情况： library(RTextTools) string <- "Players Information donation link controller support years fame glory addition champion Steer leader gang ghosts life Power Pellets tables gobble ghost" wo

我在一个字符向量中有一组独特的单词（已经“词干化”），我想知道它们中有多少出现在一个字符串中

以下是我目前掌握的情况：

library(RTextTools)

string <- "Players Information donation link controller support years fame glory addition champion Steer leader gang ghosts life Power Pellets tables gobble ghost"
wordstofind <- c("player","fame","field","donat")

# I created a stemmed list of the string
string.stem <- colnames(create_matrix(string, stemWords = T, removeStopwords = F))

库（RTextTools）
string好吧，我从不使用大型数据集，所以时间从来都不是关键，但根据您提供的数据，这将为您计算有多少单词与字符串中的某个内容完全匹配。这可能是一个很好的起点
sum(wordstofind %in% unlist(strsplit(string, " ")))

> sum(wordstofind %in% unlist(strsplit(string, " ")))
[1] 1

编辑使用茎获得正确的3个匹配项，感谢@Anthony Bissel:
sum(wordstofind %in% unlist(string.stem))

> sum(wordstofind %in% unlist(string.stem))
[1] 3

嗯，我从不使用大型数据集，所以时间从来都不是关键，但根据您提供的数据，这将为您计算有多少个单词与字符串中的某个单词完全匹配。这可能是一个很好的起点
sum(wordstofind %in% unlist(strsplit(string, " ")))

> sum(wordstofind %in% unlist(strsplit(string, " ")))
[1] 1

编辑使用茎获得正确的3个匹配项，感谢@Anthony Bissel:
sum(wordstofind %in% unlist(string.stem))

> sum(wordstofind %in% unlist(string.stem))
[1] 3

当然可能会有一个更快的选择，但这是可行的：
length(wordstofind) - length(setdiff(wordstofind, string.stem)) # 3

但安德鲁·泰勒的答案似乎更快：
`microbenchmark(sum(wordstofind %in% unlist(string.stem)), length(wordstofind) - length(setdiff(wordstofind, string.stem)))
Unit: microseconds
                                                        expr    min     lq     mean median     uq    max neval
                   sum(wordstofind %in% unlist(string.stem))  4.016  4.909  6.55562  5.355  5.801 37.485   100
length(wordstofind) - length(setdiff(wordstofind, string.stem)) 16.511 16.958 21.85303 17.404 18.296 81.218   100`

当然可能会有一个更快的选择，但这是可行的：
length(wordstofind) - length(setdiff(wordstofind, string.stem)) # 3

但安德鲁·泰勒的答案似乎更快：
`microbenchmark(sum(wordstofind %in% unlist(string.stem)), length(wordstofind) - length(setdiff(wordstofind, string.stem)))
Unit: microseconds
                                                        expr    min     lq     mean median     uq    max neval
                   sum(wordstofind %in% unlist(string.stem))  4.016  4.909  6.55562  5.355  5.801 37.485   100
length(wordstofind) - length(setdiff(wordstofind, string.stem)) 16.511 16.958 21.85303 17.404 18.296 81.218   100`

看看哈德利·威克姆的作品。您可能正在查找函数str\u count
 看看哈德利·威克姆的作品。您可能正在查找函数str\u count
 结果实际上应该是3，所以这不起作用，但可能只是使用了string
而不是词干向量。如何得到3？在提供的示例中，唯一精确且完整的匹配词是fame。抱歉，我在解释中不清楚要比较的字符串的词干sum（wordstofind%in%unlist（string.stem））
这个方法有效，而且看起来你的解决方案比我的快。请看下面我的答案。啊，我想我错过了获取茎的部分。哎呀。我很高兴它仍然有效。感谢您注意到这一点。结果实际上应该是3，所以这不起作用，但可能只是您使用了string
而不是词干向量。如何得到3？在提供的示例中，唯一精确且完整的匹配词是fame。抱歉，我在解释中不清楚要比较的字符串的词干sum（wordstofind%in%unlist（string.stem））
这个方法有效，而且看起来你的解决方案比我的快。请看下面我的答案。啊，我想我错过了获取茎的部分。哎呀。我很高兴它仍然有效。谢谢你注意到这一点。