如何通过在R中的每次观察找到最常用的单词？_R_Nlp_Text Mining

如何通过在R中的每次观察找到最常用的单词？

r nlp

如何通过在R中的每次观察找到最常用的单词？,r,nlp,text-mining,R,Nlp,Text Mining,我是NLP的新手。求你了，不要严格地评判我我有一个非常大的客户反馈数据框架，我的目标是分析反馈。我在反馈中标记单词，删除停止单词（SMART）。现在，我需要收到一张使用频率最高、使用频率较低的单词表代码如下所示： library(tokenizers) library(stopwords) words_as_tokens <- tokenize_words(dat$description, stopwords = stopwords

我是NLP的新手。求你了，不要严格地评判我

我有一个非常大的客户反馈数据框架，我的目标是分析反馈。我在反馈中标记单词，删除停止单词（SMART）。现在，我需要收到一张使用频率最高、使用频率较低的单词表

代码如下所示：

library(tokenizers)
library(stopwords)
words_as_tokens <- 
     tokenize_words(dat$description, 
                    stopwords = stopwords(language = "en", source = "smart"))

库（标记器）
图书馆（stopwords）
单词作为代词试试这个
library(tokenizers)
library(stopwords)
library(tidyverse)

# count freq of words
words_as_tokens <- setNames(lapply(sapply(dat$description, 
                                 tokenize_words, 
                                 stopwords = stopwords(language = "en", source = "smart")), 
                          function(x) as.data.frame(sort(table(x), TRUE), stringsAsFactors = F)), dat$name)

# tidyverse's job
df <- words_as_tokens %>%
  bind_rows(, .id = "name") %>%
  rename(word = x)

# output
df

#    name          word Freq
# 1  John    experience    2
# 2  John          word    2
# 3  John    absolutely    1
# 4  John        action    1
# 5  John        amazon    1
# 6  John     amazon.ae    1
# 7  John     answering    1
# ....
# 42 Alex         break    2
# 43 Alex          nice    2
# 44 Alex         times    2
# 45 Alex             8    1
# 46 Alex        accent    1
# 47 Alex        africa    1
# 48 Alex        agents    1
# ....

库（标记器）
图书馆（stopwords）
图书馆（tidyverse）
#字数
单词作为代词您可以尝试使用，如下所示：
library(quanteda)
# define a corpus object to store your initial documents
mycorpus = corpus(dat$description)
# convert the corpus to a Document-Feature Matrix
mydfm = dfm( mycorpus, 
             tolower = TRUE, 
             remove = stopwords(),  # this removes English stopwords
             remove_punct = TRUE,   # this removes punctuation
             remove_numbers = TRUE, # this removes digits
             remove_symbol = TRUE,  # this removes symbols 
             remove_url = TRUE )     # this removes urls

# calculate word frequencies and return a data.frame
word_frequencies = textstat_frequency( mydfm )

对不起，“x”是什么？什么是函数？@k1rgas它是lambda函数的参数，通常在函数中使用。例如，如果您想知道data.frame的每列缺少多少值，您可以这样做：apply（df，2，函数（x）sum（is.na（x））
。在这种情况下，x
是data.frame的每列df。
library(quanteda)
# define a corpus object to store your initial documents
mycorpus = corpus(dat$description)
# convert the corpus to a Document-Feature Matrix
mydfm = dfm( mycorpus, 
             tolower = TRUE, 
             remove = stopwords(),  # this removes English stopwords
             remove_punct = TRUE,   # this removes punctuation
             remove_numbers = TRUE, # this removes digits
             remove_symbol = TRUE,  # this removes symbols 
             remove_url = TRUE )     # this removes urls

# calculate word frequencies and return a data.frame
word_frequencies = textstat_frequency( mydfm )