计算大数据帧中的字和词干(RStudio)

计算大数据帧中的字和词干(RStudio),r,dictionary,text,text-mining,stringr,R,Dictionary,Text,Text Mining,Stringr,我有一个由tweet组成的大数据框,还有一个作为列表加载的关键字词典,其中包含与情感kw_Emo相关的单词和词干。我需要找到一种方法来计算每一条推特上出现的来自kw_Emo的任何给定单词/词干的次数。在kw_Emo中,词干用星号*标记。例如,有一个单词stem是ador*,意思是我需要解释adorable、adore、adoring或任何以ador开头的字母模式的存在 从上一个堆栈溢出讨论中可以看到我个人资料上的上一个问题,我在以下解决方案中得到了很大帮助,但它只计算精确的字符匹配,例如,仅计算

我有一个由tweet组成的大数据框,还有一个作为列表加载的关键字词典,其中包含与情感kw_Emo相关的单词和词干。我需要找到一种方法来计算每一条推特上出现的来自kw_Emo的任何给定单词/词干的次数。在kw_Emo中,词干用星号*标记。例如,有一个单词stem是ador*,意思是我需要解释adorable、adore、adoring或任何以ador开头的字母模式的存在

从上一个堆栈溢出讨论中可以看到我个人资料上的上一个问题,我在以下解决方案中得到了很大帮助,但它只计算精确的字符匹配,例如,仅计算ador,不计算adorable:

加载相关的包

图书馆长

在kw_Emo中识别并删除词干中的*

对于1中的x:lengthkw_Emo{ 如果grepl[*],kw_Emo[x]==TRUE{
kw_Emo[x]所以首先我要去掉一些for循环:

但如果使用^和$:

在代码中,您希望查看grep输出的长度,例如,将其附加到data.frame中:


当然,您可以进一步优化此代码,或者根据您的问题等,决定是否在第一个循环中使用greppaste0^,kws,x而不是grepkws,x。能否解释如何将您的答案整合到可用代码中?我尝试过使用它,但没有成功。包括第一部分,您为单词和单词创建新向量词干,这会导致空值。[我已经添加了我在原始帖子中从您的答案中使用的代码]@JasonB如果有错误消息,请检查NAs显示在哪里?请考虑我使用了您帖子中的确切变量,例如,我的TestTweets没有级别。例如,ind_stem在我的环境中保存为整数空,kw_stem和kw_word为字符空,kws和kww为空。也许我的grep有问题?其余的代码没有做任何事情,因为它依赖于第一步。@JasonB你使用的是你发布的原始kw_Emo向量吗?在我看来,你试图运行grep[*],kw_Emo,在代码的第2步中已经从中减去*。@JasonB在第一个循环中使用greppaste0^,kws,x而不是grepkws,x
structure(list(Time = c("24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:04", "24/06/2016 10:55:04", 
"24/06/2016 10:55:04", "24/06/2016 10:55:03", "24/06/2016 10:55:03"
), clean_text = c("mayagoodfellow as always making sense of it all for us ive never felt less welcome in this country brexit  httpstcoiai5xa9ywv", 
"never underestimate power of stupid people in a democracy brexit", 
"a quick guide to brexit and beyond after britain votes to quit eu httpstcos1xkzrumvg httpstcocniutojkt0", 
"this selfinflicted wound will be his legacy cameron falls on sword after brexit euref httpstcoegph3qonbj httpstcohbyhxodeda", 
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", 
"this is a very good summary no biasspinagenda of the legal ramifications of the leave result brexit httpstcolobtyo48ng", 
"you cant make this up cornwall votes out immediately pleads to keep eu cash this was never a rehearsal httpstco", 
"no matter the outcome brexit polls demonstrate how quickly half of any population can be convinced to vote against itself q", 
"i wouldnt mind so much but the result is based on a pack of lies and unaccountable promises democracy didnt win brexit pro", 
"so the uk is out cameron resigned scotland wants to leave great britain sinn fein plans to unify ireland and its o", 
"absolutely brilliant poll on brexit by yougov httpstcoepevg1moaw", 
"retweeted mikhail golub golub\r\n\r\nbrexit to be followed by grexit departugal italeave fruckoff czechout httpstcoavkpfesddz", 
"think the brexit campaign relies on the same sort of logic that drpepper does whats the worst that can happen thingsthatarewellbrexit", 
"am baffled by nigel farages claim that brexit is a victory for real people as if the 47 voting remain are fucking smu", 
"not one of the uks problems has been solved by brexit vote migration inequality the uks centurylong decline as", 
"scotland should never leave eu  calls for new independence vote grow httpstcorudiyvthia brexit", 
"the most articulate take on brexit is actually this ft reader comment today httpstco98b4dwsrtv", 
"65 million refugees half of them are children  maybe instead of fighting each other we should be working hand in hand ", 
"im laughing at people who voted for brexit but are complaining about the exchange rate affecting their holiday\r\nremain", 
"life is too short to wear boring shoes  brexit")), .Names = c("Time", 
"clean_text"), row.names = c(NA, 20L), class = c("tbl_df", "tbl", 
"data.frame"))
ind_stem <- grep("[*]", kw_Emo)
kw_stem  <- gsub("[*]", "", kw_Emo[ind_stem])
kw_word  <- kw_Emo[-ind_stem]
tweets <- strsplit(TestTweets[, "clean_text"], "\\s+")

for (kws in kw_stem) {
  count_i <- unlist(lapply(tweets, function(x) length(grep(kws, x))))
  TestTweets <- cbind(TestTweets, count_i)
  colnames(TestTweets)[ncol(TestTweets)] <- paste0(kws, "*")
}
for (kww in kw_word) {
  count_i <- unlist(lapply(tweets, function(x) length(grep(paste0("^", kww, "$"), x))))
  TestTweets <- cbind(TestTweets, count_i)
  colnames(TestTweets)[ncol(TestTweets)] <- kww
}
ind_stem <- grep("[*]", kw_Emo)
kw_stem  <- gsub("[*]", "", kw_Emo[ind_stem])
kw_word  <- kw_Emo[-ind_stem]
tweets <- strsplit(TestTweets[, "clean_text"], "\\s+")
> grep("Abc", c("Abc", "Abcdef"))
[1] 1 2
> grep("^Abc$", c("Abc", "Abcdef"))
[1] 1
for (kws in kw_stem) {
    count_i <- unlist(lapply(tweets, function(x) length(grep(kws, x))))
    TestTweets <- cbind(TestTweets, count_i)
    colnames(TestTweets)[ncol(TestTweets)] <- paste0(kws, "*")
}
for (kww in kw_word) {
    count_i <- unlist(lapply(tweets, function(x) length(grep(paste0("^", kww, "$"), x))))
    TestTweets <- cbind(TestTweets, count_i)
    colnames(TestTweets)[ncol(TestTweets)] <- kww
}
> TestTweets[19:20, c("clean_text", "boring")]
                                                                                                                    clean_text boring
19 im laughing at people who voted for brexit but are complaining about the exchange rate affecting their holiday\r\nremain      0
20                                                                           life is too short to wear boring shoes  brexit      1