R Tm包字典匹配导致比文本实际单词更高的频率_R_Dictionary_Nlp_Frequency_Tm

R Tm包字典匹配导致比文本实际单词更高的频率

r dictionary nlp

R Tm包字典匹配导致比文本实际单词更高的频率,r,dictionary,nlp,frequency,tm,R,Dictionary,Nlp,Frequency,Tm,我一直在使用下面的代码作为语料库加载文本，并使用tm包清理文本。作为下一步，我将加载一本词典并将其清理干净。然后，我将文本中的单词与词典进行匹配，以计算分数。但是，匹配结果的匹配数高于文本中的实际单词数（例如，能力分数为1500，但文本中的实际单词数仅为1000）我认为这与文本和字典的词干分析有关，因为当没有执行词干分析时，匹配度较低你知道为什么会这样吗多谢各位 R代码步骤1将数据存储为语料库 dic.competence <- read_excel(here("Raw

我一直在使用下面的代码作为语料库加载文本，并使用tm包清理文本。作为下一步，我将加载一本词典并将其清理干净。然后，我将文本中的单词与词典进行匹配，以计算分数。但是，匹配结果的匹配数高于文本中的实际单词数（例如，能力分数为1500，但文本中的实际单词数仅为1000）

我认为这与文本和字典的词干分析有关，因为当没有执行词干分析时，匹配度较低

你知道为什么会这样吗

多谢各位

R代码

步骤1将数据存储为语料库

dic.competence <- read_excel(here("Raw Data", "6. Dictionaries", "Brand.xlsx")) dic.competence <- tolower(dic.competence$COMPETENCE) dic.competence <- stemDocument(dic.competence) dic.competence <- unique(dic.competence)

corpus.terms = colnames(dtm) competence = match(corpus.terms, dic.competence, nomatch=0)

competence.score = sum(competence) / rowSums(as.matrix(dtm)) competence.score.df = data.frame(scores = competence.score)

file.path运行该行时，competency 返回什么？我不知道你的字典是怎么编的，所以我不能肯定那里发生了什么事。我带来了我自己的随机语料库文本作为主要文本，并带来了一个单独的语料库作为字典和你的代码工作得很好。competency.score.df 的行名是我的语料库中不同txt文件的名称，分数都在0-1范围内 # this is my 'dictionary' of terms: tdm <- TermDocumentMatrix(Corpus(DirSource("./corpus/corpus3")), control = list(removeNumbers = TRUE, stopwords = TRUE, stemming = TRUE, removePunctuation = TRUE)) # then I used your programming and it worked as I think you were expecting # notice what I used here for the dictionary (competence = match(colnames(dtm), Terms(tdm)[1:10], # I only used the first 10 in my test of your code nomatch = 0)) (competence.score = sum(competence)/rowSums(as.matrix(dtm))) (competence.score.df = data.frame(scores = competence.score)) #这是我的术语词典： tdm亲爱的Kat，非常感谢您提出的解决方案。我的字典设置为普通的csv，有一列术语。我尝试运行您建议的解决方案，但它仍然为我提供了一个更高的能力数字，因此仍然存在双重匹配，我无法确定。但在开始的时候，我也没有把字典当作语料库来阅读，这是一个很好的技巧。你能提供一个字典内容结构的示例吗？即使不是相同的数据，也许这会引导我或其他人给你关于如何解决这个问题的另一个想法。 # this is my 'dictionary' of terms: tdm <- TermDocumentMatrix(Corpus(DirSource("./corpus/corpus3")), control = list(removeNumbers = TRUE, stopwords = TRUE, stemming = TRUE, removePunctuation = TRUE)) # then I used your programming and it worked as I think you were expecting # notice what I used here for the dictionary (competence = match(colnames(dtm), Terms(tdm)[1:10], # I only used the first 10 in my test of your code nomatch = 0)) (competence.score = sum(competence)/rowSums(as.matrix(dtm))) (competence.score.df = data.frame(scores = competence.score))