R 由'hunspell'词典词干
从中,我使用了以下自定义词干函数:R 由'hunspell'词典词干,r,text,stemming,R,Text,Stemming,从中,我使用了以下自定义词干函数: stem_hunspell <- function(term) { # look up the term in the dictionary stems <- hunspell::hunspell_stem(term)[[1]] if (length(stems) == 0) { # if there are no stems, use the original term stem <- term } else {
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
然后我执行了以下操作:
library(qdap)
library(tm)
sentences <- iconv(sentences, "latin1", "ASCII", sub="")
sentences <- gsub('http\\S+\\s*', '', sentences)
sentences <- bracketX(sentences,bracket='all')
sentences <- gsub("[[:punct:]]", "",sentences)
sentences <- removeNumbers(sentences)
sentences <- tolower(sentences)
# Stemming
library(corpus)
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
sentences=text_tokens(sentences, stemmer = stem_hunspell)
sentences = lapply(sentences, removeWords, stopwords('en'))
sentences = lapply(sentences, stripWhitespace)
例如,为什么“提交”和“采取”会以其默认形式出现?为什么数字变得麻木了?我认为答案主要是,这正是拼音的来源。我们可以在一个更简单的例子中检查这一点:
hunspell::hunspell_stem("taking")
#> [[1]]
#> [1] "taking"
hunspell::hunspell_stem("committing")
#> [[1]]
#> [1] "committing"
Hunspill提供的唯一选项是ing表单。对我来说,这也没有多大意义,我的建议是使用不同的词干分析器。在我们进行这项工作的同时,我认为您也可以从转换到quanteda而不是tm中获益:
[[1]]
[1] "" "taking" "active" "step" "" "tackle"
[[2]]
[1] "" "numb" "" "measure" "" "" "taking" ""
[9] "support"
[[3]]
[1] "" "caught" "" "committing" "" "decent"
[7] "act"
hunspell::hunspell_stem("taking")
#> [[1]]
#> [1] "taking"
hunspell::hunspell_stem("committing")
#> [[1]]
#> [1] "committing"
library(quanteda)
sentences <- c("We're taking proactive steps to tackle ...",
"A number of measures we are taking to support ...",
"We caught him committing an indecent act.")
tokens(sentences, remove_numbers = TRUE) %>%
tokens_tolower() %>%
tokens_wordstem()
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "we'r" "take" "proactiv" "step" "to" "tackl" "."
#> [8] "." "."
#>
#> text2 :
#> [1] "a" "number" "of" "measur" "we" "are" "take"
#> [8] "to" "support" "." "." "."
#>
#> text3 :
#> [1] "we" "caught" "him" "commit" "an" "indec" "act" "."
stem_hunspell <- function(toks) {
# look up the term in the dictionary
stems <- vapply(hunspell::hunspell_stem(types(toks)), "[", 1, FUN.VALUE = character(1))
# if there are no stems, use the original term
stems[nchar(stems) == 0] <- types(toks)[nchar(stems) == 0]
tokens_replace(toks, types(toks), stems, valuetype = "fixed")
}
tokens(sentences, remove_numbers = TRUE, ) %>%
tokens_tolower() %>%
stem_hunspell()
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "we're" "taking" "active" "step" "to" "tackle" "." "."
#> [9] "."
#>
#> text2 :
#> [1] "a" "number" "of" "measure" "we" "are" "taking"
#> [8] "to" "support" "." "." "."
#>
#> text3 :
#> [1] "we" "caught" "him" "committing" "an"
#> [6] "decent" "act" "."