在R中完成stemDocument后,如何获取所有词干单词的列表及其原始格式
我正试图得到所有词干单词的列表及其原始形式 这里有一个例子在R中完成stemDocument后,如何获取所有词干单词的列表及其原始格式,r,text-mining,tm,corpus,stemming,R,Text Mining,Tm,Corpus,Stemming,我正试图得到所有词干单词的列表及其原始形式 这里有一个例子 library(tm) text <- c("Very Impressed with the shipping time, it arrived a few days earlier than expected", "it was very helpful","It was a wonderful experience") corpus<-Corpus(VectorSource(text)) corpus<-tm_ma
library(tm)
text <- c("Very Impressed with the shipping time, it arrived a few days earlier than expected", "it was very helpful","It was a wonderful experience")
corpus<-Corpus(VectorSource(text))
corpus<-tm_map(corpus,stemDocument)
这可能对你有帮助。在
SnowballC
包中有一个名为wordStem()
的函数。使用它,您可以执行以下操作。由于我在tidytext
包中使用了unnest_tokens()
,因此我首先创建了一个数据帧。该函数用于拆分单词并创建长格式数据集。似乎您想删除停止字,所以我使用filter()
执行了此操作。最后一步对你来说至关重要。我在SnowballC
包中使用wordStem()
提取数据中剩余单词的词干。结果可能不是你想要的。但我希望这能对你有所帮助
library(dplyr)
library(tidytext)
library(SnowballC)
mydf <- data_frame(id = 1:length(text),
text = text)
data(stop_words)
mydf %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% stop_words$word) %>%
mutate(stem = wordStem(word))
# id word stem
# <int> <chr> <chr>
# 1 1 impressed impress
# 2 1 shipping ship
# 3 1 time time
# 4 1 arrived arriv
# 5 1 days dai
# 6 1 earlier earlier
# 7 1 expected expect
# 8 2 helpful help
# 9 3 wonderful wonder
#10 3 experience experi
库(dplyr)
图书馆(tidytext)
图书馆(SnowballC)
多年筹资框架%
unnest_标记(输入=文本,输出=单词)%>%
筛选器(!单词%in%stop\u words$word)%>%
变异(词干=词干(词))
#id词干
#
#1.给人留下深刻印象
#2.1船舶
#3.1次
#4.1抵达
#5天1天
#6.1早些时候
#7.1预期
#8.2有益的帮助
#9.3奇观
#10.3经验试验
这比@jazzurro的答案更有效一点:
library("corpus")
text <- c("Very Impressed with the shipping time, it arrived a few days earlier than expected", "it was very helpful","It was a wonderful experience")
word <- text_types(text, collapse = TRUE, drop = stopwords_en, drop_punct = TRUE)
stem <- SnowballC::wordStem(word, "english")
data.frame(word, stem)
(如果这对您很重要,
text\u types
函数也接受tm
语料库对象。)可能应该是wordStem(word,“english”)
,除非您需要波特词干分析器。
library("corpus")
text <- c("Very Impressed with the shipping time, it arrived a few days earlier than expected", "it was very helpful","It was a wonderful experience")
word <- text_types(text, collapse = TRUE, drop = stopwords_en, drop_punct = TRUE)
stem <- SnowballC::wordStem(word, "english")
data.frame(word, stem)
word stem
1 arrived arriv
2 days day
3 earlier earlier
4 expected expect
5 experience experi
6 helpful help
7 impressed impress
8 shipping ship
9 time time
10 wonderful wonder