在R中,如何计算语料库中的特定单词?

在R中,如何计算语料库中的特定单词?,r,nlp,data-science,quanteda,R,Nlp,Data Science,Quanteda,我需要计算特定单词的频率。很多词。我知道如何做到这一点,把所有的单词放在一组,见下文,但我想得到每个特定单词的计数 这就是我目前的情况: library(quanteda) #function to count strcount <- function(x, pattern, split){unlist(lapply(strsplit(x, split),function(z) na.omit(length(grep(pattern, z)))))} txt <- "Fo

我需要计算特定单词的频率。很多词。我知道如何做到这一点,把所有的单词放在一组,见下文,但我想得到每个特定单词的计数

这就是我目前的情况:

library(quanteda)
#function to count 
strcount <- function(x, pattern, split){unlist(lapply(strsplit(x, split),function(z) na.omit(length(grep(pattern, z)))))}
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
df<-data.frame(txt)
mydict<-dictionary(list(all_terms=c("clouds","storms")))
corp <- corpus(df, text_field = 'txt')
#count terms and save output to "overview"
overview<-dfm(corp,dictionary = mydict)
overview<-convert(overview, to ='data.frame')

您可以将tidytext中的unnest_tokens函数与tidyr中的pivot_witter结合使用,以获得单独列中每个单词的计数:

library(dplyr)
library(tidytext)
library(tidyr)

txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."

mydict <- c("clouds","storms")

df <- data.frame(text = txt) %>% 
  unnest_tokens(word, text) %>%
  count(word) %>% 
  pivot_wider(names_from = word, values_from = n)

df %>% select(mydict)

# A tibble: 1 x 2
  clouds storms
   <int>  <int>
1      1      1

您可以将tidytext中的unnest_tokens函数与tidyr中的pivot_witter结合使用,以获得单独列中每个单词的计数:

library(dplyr)
library(tidytext)
library(tidyr)

txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."

mydict <- c("clouds","storms")

df <- data.frame(text = txt) %>% 
  unnest_tokens(word, text) %>%
  count(word) %>% 
  pivot_wider(names_from = word, values_from = n)

df %>% select(mydict)

# A tibble: 1 x 2
  clouds storms
   <int>  <int>
1      1      1

您希望将字典值用作标记_select中的模式,而不是在查找函数中使用它们,这就是dfmx,dictionary=。。。做以下是方法:

量子图书馆 软件包版本:2.1.2 txt% 令牌\u选择云、风暴%>% dfm 文档特征矩阵:1个文档,2个特征0.0%稀疏。 特征 多云风暴 文本1
您希望将字典值用作标记_select中的模式,而不是在查找函数中使用它们,这就是dfmx,dictionary=。。。做以下是方法:

量子图书馆 软件包版本:2.1.2 txt% 令牌\u选择云、风暴%>% dfm 文档特征矩阵:1个文档,2个特征0.0%稀疏。 特征 多云风暴 文本1