调用StemCompletion和PlainTextDocument在R中损坏文本
给定一个文本语料库,您希望在R中使用词干分析和词干完成来规范术语,但是,stemCompletion步骤在0.6.x版本的软件包中存在问题。使用R 3.3.1和tm 0.6-2 这个问题以前被问过,但还没有看到一个真正有效的完整答案。下面是正确演示该问题的完整代码调用StemCompletion和PlainTextDocument在R中损坏文本,r,data-manipulation,R,Data Manipulation,给定一个文本语料库,您希望在R中使用词干分析和词干完成来规范术语,但是,stemCompletion步骤在0.6.x版本的软件包中存在问题。使用R 3.3.1和tm 0.6-2 这个问题以前被问过,但还没有看到一个真正有效的完整答案。下面是正确演示该问题的完整代码 require(tm) txt <- c("Once we have a corpus we typically want to modify the documents in it", "e.g., s
require(tm)
txt <- c("Once we have a corpus we typically want to modify the documents in it",
"e.g., stemming, stopword removal, et cetera.",
"In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpusCopy <- myCorpus
# *Removing common word endings* (e.g., "ing", "es")
myCorpus <- tm_map(myCorpus, stemDocument, language = "english")
# Next, we remove all the empty spaces generated by isolating the
# word stems in the previous step.
myCorpus <- tm_map(myCorpus, content_transformer(stripWhitespace))
tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)
在此阶段,语料库不再是TextDocument,创建TermDocumentMatrix失败,错误为:inheritsdoc,TextDocument为not TRUE。已经记录了下一步应用PlainTextDocument函数
myCorpus <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)
以下是输出:
<<TermDocumentMatrix (terms: 19, documents: 2)>>
Non-/sparse entries: 20/18
Sparsity : 47%
Maximal term length: 9
Weighting : term frequency (tf)
[1] "all" "cetera" "concept" "corpus" "document"
[6] "function" "have" "into" "modifi" "onc"
[11] "remov" "stem" "stopword" "subsum" "the"
[16] "this" "transform" "typic" "want"
<TermDocumentMatrix (terms: 2, documents: 2)>>
Non-/sparse entries: 4/0
Sparsity : 0%
Maximal term length: 7
Weighting : term frequency (tf)
[1] "content" "meta"
调用PlainTextDocument已损坏语料库
预期词干单词将被完成:例如modifi=>modifier,onc=>one次,等等。调用PlainTextDocument不会损坏语料库
你可能已经注意到,当你跑这条线的时候
myCorpus = tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)
您收到了几个警告消息:
值得一提的是,
这是如何使用您的数据进行阀杆完井的阀杆计算:
txt <- c("Once we have a corpus we typically want to modify the documents in it",
"e.g., stemming, stopword removal, et cetera.",
"In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
tdm <- TermDocumentMatrix(myCorpus, control = list(stemming = TRUE))
cbind(stems = rownames(tdm), completed = stemCompletion(rownames(tdm), myCorpus))
要将更改永久写入TDM,请执行以下操作:
stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),
dictionary=dict, type="shortest"),sep="",
collapse=" ")))}
tdm <- stemCompletion_mod(rownames(tdm), myCorpus)
tdm$content
[1] 所有cetera概念语料库文档的功能都已转换为NA
一旦删除,词干停止字就包含了这个转换
通常想要
调用PlainTextDocument不会损坏语料库
你可能已经注意到,当你跑这条线的时候
myCorpus = tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)
您收到了几个警告消息:
值得一提的是,
这是如何使用您的数据进行阀杆完井的阀杆计算:
txt <- c("Once we have a corpus we typically want to modify the documents in it",
"e.g., stemming, stopword removal, et cetera.",
"In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
tdm <- TermDocumentMatrix(myCorpus, control = list(stemming = TRUE))
cbind(stems = rownames(tdm), completed = stemCompletion(rownames(tdm), myCorpus))
要将更改永久写入TDM,请执行以下操作:
stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),
dictionary=dict, type="shortest"),sep="",
collapse=" ")))}
tdm <- stemCompletion_mod(rownames(tdm), myCorpus)
tdm$content
[1] 所有cetera概念语料库文档的功能都已转换为NA
一旦删除,词干停止字就包含了这个转换
通常想要
关于Hack-R的解决方案,我和Jason有相同的问题,我想让StemCompleted单词用于单词云,并作为TDM的一部分 由于stemCompletion不返回TDM,因此我从TDM中提取了术语,然后在此基础上运行stemCompletion 在测试时,我将它们分解为一个单独的变量
require(tm)
txt <- c("Once we have a corpus we typically want to modify the documents in it",
"e.g., stemming, stopword removal, et cetera.",
"In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpusCopy <- myCorpus
# *Removing common word endings* (e.g., "ing", "es")
myCorpus <- tm_map(myCorpus, stemDocument, language = "english")
# Next, we remove all the empty spaces generated by isolating the
# word stems in the previous step.
myCorpus <- tm_map(myCorpus, content_transformer(stripWhitespace))
tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)
由于stemCompletion似乎返回一个字符表,因此我将“tdm”的术语部分替换为stemCompleted版本:
tdm$dimnames$Terms <- as.character(stemCompletion(tdm$dimnames$Terms, myCorpusCopy, type = "prevalent"))
print(tdm$dimnames$Terms)
你会在单词上得到空白字段,显然,它不知道如何使用modifi,但至少这次你可以使用stemCompleted版本…关于Hack-R的解决方案,我和Jason有相同的问题,我想在单词云中使用stemCompleted单词,并作为TDM的一部分 由于stemCompletion不返回TDM,因此我从TDM中提取了术语,然后在此基础上运行stemCompletion 在测试时,我将它们分解为一个单独的变量
require(tm)
txt <- c("Once we have a corpus we typically want to modify the documents in it",
"e.g., stemming, stopword removal, et cetera.",
"In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpusCopy <- myCorpus
# *Removing common word endings* (e.g., "ing", "es")
myCorpus <- tm_map(myCorpus, stemDocument, language = "english")
# Next, we remove all the empty spaces generated by isolating the
# word stems in the previous step.
myCorpus <- tm_map(myCorpus, content_transformer(stripWhitespace))
tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)
由于stemCompletion似乎返回一个字符表,因此我将“tdm”的术语部分替换为stemCompleted版本:
tdm$dimnames$Terms <- as.character(stemCompletion(tdm$dimnames$Terms, myCorpusCopy, type = "prevalent"))
print(tdm$dimnames$Terms)
显然,你会在不知道如何使用modifi的单词上看到空白字段,但至少这次你可以使用stemCompleted版本…问题中提到的可能的重复,此问题已被重复,但尚未看到完整的答案,更常见的情况是,该问题不完全独立,例如,加载了文本文件。发现该问题后,答案对我不起作用。我猜是因为之前的错误,您收到了3条警告消息。当我按照下面更新的答案进行操作时,它适用于我的示例。希望这会有帮助,干杯。问题中提到的可能的副本,此问题已被重复,但尚未看到完整的答案,更常见的情况是,该问题不完全独立,例如,加载了文本文件。发现该问题后,答案对我不起作用。我猜是因为之前的错误,您收到了3条警告消息。当我按照下面更新的答案进行操作时,它适用于我的示例。希望对您有所帮助,干杯。获取stem已完成单词的列表很有用,但此时TermDocumentMatrix仍有未插入的单词,并且在wordcloud或其他软件包中使用tdm仍有未插入的单词。@JasonM1 OK我更新了它,将更改写回tdm。我从这里得到了这个函数:谢谢@Hack-R的回答,但是如果在上面的步骤之后使用带有tdm的wordcloud,wordcloud仍然显示未插入的术语。另外,当我运行上面的代码时,有一个空字符串,其中modify是为modifi列出的。在Windows上使用带有tm 0.6-2的R 3.3.1。获取stem已完成单词的列表非常有用,但此时TermDocumentMatrix仍有未插入的单词,在wordcloud或其他软件包中使用tdm仍有未插入的单词。@JasonM1 OK我已将其更新,以将更改写回tdm。我从这里得到了这个函数:谢谢@Hack-R的回答,但是如果在上面的步骤之后使用带有tdm的wordcloud,那么wordcloud仍然显示未插入的术语。另外,当我运行上面的代码时,在modify列出的地方有一个空字符串 对莫迪菲来说。在Windows上使用R 3.3.1和tm 0.6-2。