R 了解dfm_组如何在不添加组的情况下工作_R_Quanteda

R 了解dfm_组如何在不添加组的情况下工作

R 了解dfm_组如何在不添加组的情况下工作,r,quanteda,R,Quanteda,基于这个问题：如果我有这个功能： plot_topterms = function(data,text_field,n,...){ corp=corpus(data,text_field = text_field) %>% dfm(remove_numbers=T,remove_punct=T,remove=c(stopwords('english')),ngrams=1:2) %>% dfm_weight(scheme ='prop') %>

基于这个问题：

如果我有这个功能：

     plot_topterms = function(data,text_field,n,...){

  corp=corpus(data,text_field = text_field) %>% 
    dfm(remove_numbers=T,remove_punct=T,remove=c(stopwords('english')),ngrams=1:2) %>%
    dfm_weight(scheme ='prop') %>% 
    dfm_group(groups=...) %>% 
    dfm_replace(pattern=as.character(lemma$first),replacement = as.character(lemma$X1)) %>% 
    dfm_remove(pattern = c(paste0("^", stopwords("english"), "_"), paste0("_", stopwords("english"), "$")), valuetype = "regex") %>% 
    dfm_remove(toRemove)
  freq_weight <- textstat_frequency(corp, n = n)

  ggplot(data = freq_weight, aes(x = nrow(freq_weight):1, y = frequency)) +
    geom_bar(stat='identity')+
    facet_wrap(~ group, scales = "free") +
    coord_flip() +
    scale_x_continuous(breaks = nrow(freq_weight):1,
                       labels = freq_weight$feature) +
    #scale_y_continuous(labels = scales::percent)+
    theme(text = element_text(size=20))+
    labs(x = NULL, y = "Relative frequency")
}

plot\u topterms=函数（数据、文本字段、n，…）{
corp=语料库（数据，文本字段=文本字段）%>%
dfm（删除数字=T，删除点号=T，删除=c（停止字（'english'）），ngrams=1:2）%>%
dfm_重量（方案class='prop'）%>%
dfm_组（组=…）%>%
dfm_替换（模式=as.character（引理$first），替换=as.character（引理$X1））%>%
dfm_remove（pattern=c（paste0（“^”、stopwords（“英语”）、“u0”）、paste0（“0”）、stopwords（“英语”）、“$”），valuetype=“regex”）%>%
dfm_移除（toRemove）
freq\u weight您对“all”组的解释是正确的。在textstat\u frequency（）
中不指定groups
的效果是该组将默认为“all”。在您的函数中，您从未在调用此函数时传递groups
参数，因此它将始终为“all”，即使您已经通过函数内的plot\u topterms（）
调用将dfm分组
此绘图中某个功能的值为60意味着此功能的相对术语频率（在文档中）之和为60。如果您查看，您将看到这个简单示例的工作原理。A在text1中的相对频率为0.20，在text2中的相对频率为0.67，因此textstat\u frequency（）
将这两者相加为0.87。您的60与此0.87类似
这与document frequency不同，document frequency是功能出现的文档数（至少一次）。如果您想知道功能的文档频率（这是您的解释），那么您应该从textstat\u frequency
返回中绘制docfreq
，而不是frequency

然而，我要指出，plot\u topterms（）
不是一个设计良好的函数

它依赖于几个非函数本地的变量，即toRemove
和lemma

它将无法在dfm_group（）
调用中正确传递..
。您应该在函数签名中明确指定groups
参数

如果我们正在为包设计一个新函数，我们将创建一个新函数textplot\u frequency（）
，它从textstat\u frequency（）
绘制一个返回，该返回基本上只实现了ggplot（）
在用户构建了textstat\u frequency
对象后调用。这可以更智能地使用每个textstat\u frequency
对象中内置的组变量，以便那些唯一组为“all”的对象将其绘制为单个方面