在R中同时输出单字符和双字符的文本

在R中同时输出单字符和双字符的文本,r,nlp,n-gram,R,Nlp,N Gram,我试图找出如何在R中识别文本中的单字和双字,然后根据阈值在最终输出中保留这两个单字和双字。我已经在Python中使用gensim的短语模型实现了这一点,但还没有弄清楚如何在R中实现这一点 例如: strings <- data.frame(text = 'This is a great movie from yesterday', 'I went to the movies', 'Great movie time at the theater', 'I went to the theate

我试图找出如何在R中识别文本中的单字和双字,然后根据阈值在最终输出中保留这两个单字和双字。我已经在Python中使用gensim的短语模型实现了这一点,但还没有弄清楚如何在R中实现这一点

例如:

strings <- data.frame(text = 'This is a great movie from yesterday', 'I went to the movies', 'Great movie time at the theater', 'I went to the theater yesterday')
#Pseudocode below
bigs <- tokenize_uni_bi(strings, n = 1:2, threshold = 2)
print(bigs)
[['this', 'great_movie', 'yesterday'], ['went', 'movies'], ['great_movie', 'theater'], ['went', 'theater', 'yesterday']]

strings您可以为此使用quanteda框架:

library(quanteda)
# tokenize, tolower, remove stopwords and create ngrams
my_toks <- tokens(strings$text) 
my_toks <- tokens_tolower(my_toks)
my_toks <- tokens_remove(my_toks, stopwords("english"))
bigs <- tokens_ngrams(my_toks, n = 1:2)

# turn into document feature matrix and filter on minimum frequency of 2 and more
my_dfm <- dfm(bigs)
dfm_trim(my_dfm, min_termfreq = 2)

Document-feature matrix of: 4 documents, 6 features (50.0% sparse).
       features
docs    great movie yesterday great_movie went theater
  text1     1     1         1           1    0       0
  text2     0     0         0           0    1       0
  text3     1     1         0           1    0       1
  text4     0     0         1           0    1       1

# use convert function to turn this into a data.frame

quanteda和tidytext都有很多在线帮助。请参阅cran上两个软件包的渐晕图。

您可以使用quanteda框架:

library(quanteda)
# tokenize, tolower, remove stopwords and create ngrams
my_toks <- tokens(strings$text) 
my_toks <- tokens_tolower(my_toks)
my_toks <- tokens_remove(my_toks, stopwords("english"))
bigs <- tokens_ngrams(my_toks, n = 1:2)

# turn into document feature matrix and filter on minimum frequency of 2 and more
my_dfm <- dfm(bigs)
dfm_trim(my_dfm, min_termfreq = 2)

Document-feature matrix of: 4 documents, 6 features (50.0% sparse).
       features
docs    great movie yesterday great_movie went theater
  text1     1     1         1           1    0       0
  text2     0     0         0           0    1       0
  text3     1     1         0           1    0       1
  text4     0     0         1           0    1       1

# use convert function to turn this into a data.frame

quanteda和tidytext都有很多在线帮助。请在cran上查看包含两个软件包的小插曲。

谢谢@phiver!第二个答案正是我想要的-应该在tidytext文档中更仔细地查找。这是一个非常好的答案的脚注:
quanteda
支持/重新导出
%%
,因此
dplyr
-样式链是可能的。谢谢@phiver!第二个答案正是我想要的-应该在tidytext文档中更仔细地查找。这是一个非常好的答案的脚注:
quanteda
支持/重新导出
%%
,因此
dplyr
样式的链是可能的。