如何通过在R中的每次观察找到最常用的单词?
我是NLP的新手。求你了,不要严格地评判我 我有一个非常大的客户反馈数据框架,我的目标是分析反馈。我在反馈中标记单词,删除停止单词(SMART)。现在,我需要收到一张使用频率最高、使用频率较低的单词表 代码如下所示:如何通过在R中的每次观察找到最常用的单词?,r,nlp,text-mining,R,Nlp,Text Mining,我是NLP的新手。求你了,不要严格地评判我 我有一个非常大的客户反馈数据框架,我的目标是分析反馈。我在反馈中标记单词,删除停止单词(SMART)。现在,我需要收到一张使用频率最高、使用频率较低的单词表 代码如下所示: library(tokenizers) library(stopwords) words_as_tokens <- tokenize_words(dat$description, stopwords = stopwords
library(tokenizers)
library(stopwords)
words_as_tokens <-
tokenize_words(dat$description,
stopwords = stopwords(language = "en", source = "smart"))
库(标记器)
图书馆(stopwords)
单词作为代词试试这个
library(tokenizers)
library(stopwords)
library(tidyverse)
# count freq of words
words_as_tokens <- setNames(lapply(sapply(dat$description,
tokenize_words,
stopwords = stopwords(language = "en", source = "smart")),
function(x) as.data.frame(sort(table(x), TRUE), stringsAsFactors = F)), dat$name)
# tidyverse's job
df <- words_as_tokens %>%
bind_rows(, .id = "name") %>%
rename(word = x)
# output
df
# name word Freq
# 1 John experience 2
# 2 John word 2
# 3 John absolutely 1
# 4 John action 1
# 5 John amazon 1
# 6 John amazon.ae 1
# 7 John answering 1
# ....
# 42 Alex break 2
# 43 Alex nice 2
# 44 Alex times 2
# 45 Alex 8 1
# 46 Alex accent 1
# 47 Alex africa 1
# 48 Alex agents 1
# ....
库(标记器)
图书馆(stopwords)
图书馆(tidyverse)
#字数
单词作为代词您可以尝试使用,如下所示:
library(quanteda)
# define a corpus object to store your initial documents
mycorpus = corpus(dat$description)
# convert the corpus to a Document-Feature Matrix
mydfm = dfm( mycorpus,
tolower = TRUE,
remove = stopwords(), # this removes English stopwords
remove_punct = TRUE, # this removes punctuation
remove_numbers = TRUE, # this removes digits
remove_symbol = TRUE, # this removes symbols
remove_url = TRUE ) # this removes urls
# calculate word frequencies and return a data.frame
word_frequencies = textstat_frequency( mydfm )
对不起,“x”是什么?什么是函数?@k1rgas它是lambda函数的参数,通常在函数中使用。例如,如果您想知道data.frame的每列缺少多少值,您可以这样做:apply(df,2,函数(x)sum(is.na(x))
。在这种情况下,x
是data.frame的每列df
。
library(quanteda)
# define a corpus object to store your initial documents
mycorpus = corpus(dat$description)
# convert the corpus to a Document-Feature Matrix
mydfm = dfm( mycorpus,
tolower = TRUE,
remove = stopwords(), # this removes English stopwords
remove_punct = TRUE, # this removes punctuation
remove_numbers = TRUE, # this removes digits
remove_symbol = TRUE, # this removes symbols
remove_url = TRUE ) # this removes urls
# calculate word frequencies and return a data.frame
word_frequencies = textstat_frequency( mydfm )