text2vec:在使用函数create\u词汇表后,迭代词汇表

text2vec:在使用函数create\u词汇表后,迭代词汇表,r,text-analysis,text2vec,R,Text Analysis,Text2vec,使用text2vec包,我创建了一个词汇表 vocab = create_vocabulary(it_0, ngram = c(2L, 2L)) vocab看起来像这样 > vocab Number of docs: 120 0 stopwords: ... ngram_min = 2; ngram_max = 2 Vocabulary: terms terms_counts doc_counts 1: knight_se

使用text2vec包,我创建了一个词汇表

vocab = create_vocabulary(it_0, ngram = c(2L, 2L)) 
vocab看起来像这样

> vocab
Number of docs: 120 
0 stopwords:  ... 
ngram_min = 2; ngram_max = 2 
Vocabulary: 
                    terms terms_counts doc_counts
    1:    knight_severely            1          1
    2:       movie_expect            1          1
    3: recommend_watching            1          1
    4:        nuke_entire            1          1
    5:      sense_keeping            1          1
   ---                                           
14467:         stand_idly            1          1
14468:    officer_loyalty            1          1
14469:    willingness_die            1          1
14470:         fight_bane            3          3
14471:     bane_beginning            1          1
如何检查列项\u计数的范围?我需要这个,因为它将有助于我在修剪,这是我的下一步

pruned_vocab = prune_vocabulary(vocab, term_count_min = <BLANK>)
pruned\u vocab=prune\u词汇表(vocab,术语\u count\u min=)
以下代码是可复制的

library(text2vec)

text <- c(" huge fan superhero movies expectations batman begins viewing christopher 
          nolan production pleasantly shocked huge expectations dark knight christopher 
          nolan blew expectations dust happen film dark knight rises simply big expectations 
          blown production true cinematic experience behold movie exceeded expectations terms 
          action entertainment",                                                       
          "christopher nolan outdone morning tired awake set film films genuine emotional 
          eartbeat felt flaw nolan films vision emotion hollow bought felt hero villain 
          alike christian bale typically brilliant batman felt bruce wayne heavily embraced
          final installment bale added emotional depth character plot point astray dark knight")

it_0 = itoken( text,
               tokenizer = word_tokenizer,
               progressbar = T)

vocab = create_vocabulary(it_0, ngram = c(2L, 2L)) 
vocab
库(text2vec)

textTry
range(vocab$vocab$terms\u counts)
vocab
是一些元信息(文档数量、ngram大小等)和main
data.frame/data.table
的列表,其中包含单词计数和每个单词的文档计数

如前所述,
vocab$vocab
是您所需要的(
data.table
带计数)

您可以通过调用
str(vocab)
来查找内部结构:

5人名单
$vocab:类“data.table”和“data.frame”:82 obs。共有3个变量:
..$terms:chr[1:82]“绘图点”“深度”“字符”“情感深度”“添加了”。。。
..$terms_计数:int[1:82]1。。。
..$doc_计数:int[1:82]1。。。
..-attr(*,“.internal.selfref”)=
$ngram:Named int[1:2]2
..-attr(*,“name”)=chr[1:2]“ngram\u min”“ngram\u max”
$document\u count:int 2
$stopwords:chr(0)
$sep\u ngram:chr“\u”
-属性(*,“类”)=chr“text2vec_词汇”
List of 5
 $ vocab         :Classes ‘data.table’ and 'data.frame':    82 obs. of  3 variables:
  ..$ terms       : chr [1:82] "plot_point" "depth_character" "emotional_depth" "bale_added" ...
  ..$ terms_counts: int [1:82] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ doc_counts  : int [1:82] 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ ngram         : Named int [1:2] 2 2
  ..- attr(*, "names")= chr [1:2] "ngram_min" "ngram_max"
 $ document_count: int 2
 $ stopwords     : chr(0) 
 $ sep_ngram     : chr "_"
 - attr(*, "class")= chr "text2vec_vocabulary"