在R中完成stemDocument后，如何获取所有词干单词的列表及其原始格式_R_Text Mining_Tm_Corpus_Stemming

在R中完成stemDocument后，如何获取所有词干单词的列表及其原始格式

在R中完成stemDocument后，如何获取所有词干单词的列表及其原始格式,r,text-mining,tm,corpus,stemming,R,Text Mining,Tm,Corpus,Stemming,我正试图得到所有词干单词的列表及其原始形式这里有一个例子 library(tm) text <- c("Very Impressed with the shipping time, it arrived a few days earlier than expected", "it was very helpful","It was a wonderful experience") corpus<-Corpus(VectorSource(text)) corpus<-tm_ma

我正试图得到所有词干单词的列表及其原始形式

这里有一个例子

library(tm)
text <- c("Very Impressed with the shipping time, it arrived a few days earlier than expected", "it was very helpful","It was a wonderful experience")
corpus<-Corpus(VectorSource(text))
corpus<-tm_map(corpus,stemDocument)

这可能对你有帮助。在

SnowballC

包中有一个名为

wordStem（）

的函数。使用它，您可以执行以下操作。由于我在

tidytext

包中使用了

unnest_tokens（）

，因此我首先创建了一个数据帧。该函数用于拆分单词并创建长格式数据集。似乎您想删除停止字，所以我使用

filter（）

执行了此操作。最后一步对你来说至关重要。我在

SnowballC

包中使用

wordStem（）

提取数据中剩余单词的词干。结果可能不是你想要的。但我希望这能对你有所帮助

library(dplyr)
library(tidytext)
library(SnowballC)

mydf <- data_frame(id = 1:length(text),
                   text = text)

data(stop_words)

mydf %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% stop_words$word) %>%
mutate(stem = wordStem(word))

#      id       word    stem
#   <int>      <chr>   <chr>
# 1     1  impressed impress
# 2     1   shipping    ship
# 3     1       time    time
# 4     1    arrived   arriv
# 5     1       days     dai
# 6     1    earlier earlier
# 7     1   expected  expect
# 8     2    helpful    help
# 9     3  wonderful  wonder
#10     3 experience  experi

库（dplyr）
图书馆（tidytext）
图书馆（SnowballC）
多年筹资框架%
unnest_标记（输入=文本，输出=单词）%>%
筛选器（！单词%in%stop\u words$word）%>%
变异（词干=词干（词））
#id词干
#            
#1.给人留下深刻印象
#2.1船舶
#3.1次
#4.1抵达
#5天1天
#6.1早些时候
#7.1预期
#8.2有益的帮助
#9.3奇观
#10.3经验试验

这比@jazzurro的答案更有效一点：

library("corpus")
text <- c("Very Impressed with the shipping time, it arrived a few days earlier than expected", "it was very helpful","It was a wonderful experience")
word <- text_types(text, collapse = TRUE, drop = stopwords_en, drop_punct = TRUE)
stem <- SnowballC::wordStem(word, "english")
data.frame(word, stem)

（如果这对您很重要，

text\u types

函数也接受

tm

语料库对象。）

可能应该是

wordStem（word，“english”）

，除非您需要波特词干分析器。

library("corpus")
text <- c("Very Impressed with the shipping time, it arrived a few days earlier than expected", "it was very helpful","It was a wonderful experience")
word <- text_types(text, collapse = TRUE, drop = stopwords_en, drop_punct = TRUE)
stem <- SnowballC::wordStem(word, "english")
data.frame(word, stem)

         word    stem
1     arrived   arriv
2        days     day
3     earlier earlier
4    expected  expect
5  experience  experi
6     helpful    help
7   impressed impress
8    shipping    ship
9        time    time 
10  wonderful  wonder