R tm软件包中的Stemdocument不处理过去时单词
我有一个文件“check_text.txt”,其中包含make。我想对它执行词干分析,以获得话语权。我尝试在tm包中使用stemDocument,如下所示,但只得到了make。有没有一种方法可以对过去时单词进行词干分析?在现实世界的自然语言处理中有必要这样做吗?谢谢R tm软件包中的Stemdocument不处理过去时单词,r,nlp,tm,stemming,snowball,R,Nlp,Tm,Stemming,Snowball,我有一个文件“check_text.txt”,其中包含make。我想对它执行词干分析,以获得话语权。我尝试在tm包中使用stemDocument,如下所示,但只得到了make。有没有一种方法可以对过去时单词进行词干分析?在现实世界的自然语言处理中有必要这样做吗?谢谢 filename = 'check_text.txt' con <- file(filename, "rb") text_data <- readLines(con,skipNul = TRUE) close(con)
filename = 'check_text.txt'
con <- file(filename, "rb")
text_data <- readLines(con,skipNul = TRUE)
close(con)
text_VS <- VectorSource(text_data)
text_corpus <- VCorpus(text_VS)
text_corpus <- tm_map(text_corpus, stemDocument, language = "english")
as.data.frame(text_corpus)$text
如果在一个包中有一组不规则英语动词的数据,那么这个任务就很容易了。我只是不知道有任何包含此类数据的包,所以我选择通过刮取创建自己的数据库。我不确定这个网站是否涵盖了所有不规则的单词。如果需要,您希望搜索更好的网站以创建自己的数据库。一旦你有了数据库,你就可以投入到你的任务中去了 首先,我使用stemDocument并用-s清理当前表单。然后,我收集了单词中的过去形式,即过去,过去形式的不定式形式,即inf1,确定了过去形式在temp中的顺序。我进一步确定了temp中过去表格的位置。我最后用不定式代替了sat形式。我对过去分词重复了同样的步骤
library(tm)
library(rvest)
library(dplyr)
library(splitstackshape)
### Create a database
x <- read_html("http://www.englishpage.com/irregularverbs/irregularverbs.html")
x %>%
html_table(header = TRUE) %>%
bind_rows %>%
rename(Past = `Simple Past`, PP = `Past Participle`) %>%
filter(!Infinitive %in% LETTERS) %>%
cSplit(splitCols = c("Past", "PP"),
sep = " / ", direction = "long") %>%
filter(complete.cases(.)) %>%
mutate_each(funs(gsub(pattern = "\\s\\(.*\\)$|\\s\\[\\?\\]",
replacement = "",
x = .))) -> mydic
### Work on the task
words <- c("said", "drawn", "say", "says", "make", "made", "done")
### says to say
temp <- stemDocument(words)
### past forms become present form
### Collect past forms
past <- mydic$Past[which(mydic$Past %in% temp)]
### Collect infinitive forms of past forms
inf1 <- mydic$Infinitive[which(mydic$Past %in% temp)]
### Identify the order of past forms in temp
ind <- match(temp, past)
ind <- ind[is.na(ind) == FALSE]
### Where are the past forms in temp?
position <- which(temp %in% past)
temp[position] <- inf1[ind]
### Check
temp
#[1] "say" "drawn" "say" "say" "make" "make" "done"
### PP forms to infinitive forms (same as past forms)
pp <- mydic$PP[which(mydic$PP %in% temp)]
inf2 <- mydic$Infinitive[which(mydic$PP %in% temp)]
ind <- match(temp, pp)
ind <- ind[is.na(ind) == FALSE]
position <- which(temp %in% pp)
temp[position] <- inf2[ind]
### Check
temp
#[1] "say" "draw" "say" "say" "make" "make" "do"
library(tm)
library(rvest)
library(dplyr)
library(splitstackshape)
### Create a database
x <- read_html("http://www.englishpage.com/irregularverbs/irregularverbs.html")
x %>%
html_table(header = TRUE) %>%
bind_rows %>%
rename(Past = `Simple Past`, PP = `Past Participle`) %>%
filter(!Infinitive %in% LETTERS) %>%
cSplit(splitCols = c("Past", "PP"),
sep = " / ", direction = "long") %>%
filter(complete.cases(.)) %>%
mutate_each(funs(gsub(pattern = "\\s\\(.*\\)$|\\s\\[\\?\\]",
replacement = "",
x = .))) -> mydic
### Work on the task
words <- c("said", "drawn", "say", "says", "make", "made", "done")
### says to say
temp <- stemDocument(words)
### past forms become present form
### Collect past forms
past <- mydic$Past[which(mydic$Past %in% temp)]
### Collect infinitive forms of past forms
inf1 <- mydic$Infinitive[which(mydic$Past %in% temp)]
### Identify the order of past forms in temp
ind <- match(temp, past)
ind <- ind[is.na(ind) == FALSE]
### Where are the past forms in temp?
position <- which(temp %in% past)
temp[position] <- inf1[ind]
### Check
temp
#[1] "say" "drawn" "say" "say" "make" "make" "done"
### PP forms to infinitive forms (same as past forms)
pp <- mydic$PP[which(mydic$PP %in% temp)]
inf2 <- mydic$Infinitive[which(mydic$PP %in% temp)]
ind <- match(temp, pp)
ind <- ind[is.na(ind) == FALSE]
position <- which(temp %in% pp)
temp[position] <- inf2[ind]
### Check
temp
#[1] "say" "draw" "say" "say" "make" "make" "do"