在R中，如何计算语料库中的特定单词？_R_Nlp_Data Science_Quanteda

在R中，如何计算语料库中的特定单词？

r nlp

在R中，如何计算语料库中的特定单词？,r,nlp,data-science,quanteda,R,Nlp,Data Science,Quanteda,我需要计算特定单词的频率。很多词。我知道如何做到这一点，把所有的单词放在一组，见下文，但我想得到每个特定单词的计数这就是我目前的情况： library(quanteda) #function to count strcount <- function(x, pattern, split){unlist(lapply(strsplit(x, split),function(z) na.omit(length(grep(pattern, z)))))} txt <- "Fo

我需要计算特定单词的频率。很多词。我知道如何做到这一点，把所有的单词放在一组，见下文，但我想得到每个特定单词的计数

这就是我目前的情况：

library(quanteda)
#function to count 
strcount <- function(x, pattern, split){unlist(lapply(strsplit(x, split),function(z) na.omit(length(grep(pattern, z)))))}
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
df<-data.frame(txt)
mydict<-dictionary(list(all_terms=c("clouds","storms")))
corp <- corpus(df, text_field = 'txt')
#count terms and save output to "overview"
overview<-dfm(corp,dictionary = mydict)
overview<-convert(overview, to ='data.frame')

您可以将tidytext中的unnest_tokens函数与tidyr中的pivot_witter结合使用，以获得单独列中每个单词的计数：

library(dplyr)
library(tidytext)
library(tidyr)

txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."

mydict <- c("clouds","storms")

df <- data.frame(text = txt) %>% 
  unnest_tokens(word, text) %>%
  count(word) %>% 
  pivot_wider(names_from = word, values_from = n)

df %>% select(mydict)

# A tibble: 1 x 2
  clouds storms
   <int>  <int>
1      1      1

您可以将tidytext中的unnest_tokens函数与tidyr中的pivot_witter结合使用，以获得单独列中每个单词的计数：

library(dplyr)
library(tidytext)
library(tidyr)

txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."

mydict <- c("clouds","storms")

df <- data.frame(text = txt) %>% 
  unnest_tokens(word, text) %>%
  count(word) %>% 
  pivot_wider(names_from = word, values_from = n)

df %>% select(mydict)

# A tibble: 1 x 2
  clouds storms
   <int>  <int>
1      1      1

您希望将字典值用作标记_select中的模式，而不是在查找函数中使用它们，这就是dfmx，dictionary=。。。做以下是方法：

量子图书馆软件包版本：2.1.2 txt% 令牌\u选择云、风暴%>% dfm 文档特征矩阵：1个文档，2个特征0.0%稀疏。特征多云风暴文本1

您希望将字典值用作标记_select中的模式，而不是在查找函数中使用它们，这就是dfmx，dictionary=。。。做以下是方法：

量子图书馆软件包版本：2.1.2 txt% 令牌\u选择云、风暴%>% dfm 文档特征矩阵：1个文档，2个特征0.0%稀疏。特征多云风暴文本1