调用StemCompletion和PlainTextDocument在R中损坏文本_R_Data Manipulation

调用StemCompletion和PlainTextDocument在R中损坏文本

调用StemCompletion和PlainTextDocument在R中损坏文本,r,data-manipulation,R,Data Manipulation,给定一个文本语料库，您希望在R中使用词干分析和词干完成来规范术语，但是，stemCompletion步骤在0.6.x版本的软件包中存在问题。使用R 3.3.1和tm 0.6-2 这个问题以前被问过，但还没有看到一个真正有效的完整答案。下面是正确演示该问题的完整代码 require(tm) txt <- c("Once we have a corpus we typically want to modify the documents in it", "e.g., s

给定一个文本语料库，您希望在R中使用词干分析和词干完成来规范术语，但是，stemCompletion步骤在0.6.x版本的软件包中存在问题。使用R 3.3.1和tm 0.6-2

这个问题以前被问过，但还没有看到一个真正有效的完整答案。下面是正确演示该问题的完整代码

 require(tm)
 txt <- c("Once we have a corpus we typically want to modify the documents in it",
          "e.g., stemming, stopword removal, et cetera.",
          "In tm, all this functionality is subsumed into the concept of a transformation.")

 myCorpus <- Corpus(VectorSource(txt))

 myCorpus <- tm_map(myCorpus, content_transformer(tolower))
 myCorpus <- tm_map(myCorpus, removePunctuation)
 myCorpusCopy <- myCorpus

 # *Removing common word endings* (e.g., "ing", "es") 
 myCorpus <- tm_map(myCorpus, stemDocument, language = "english")

 # Next, we remove all the empty spaces generated by isolating the
 # word stems in the previous step.
 myCorpus <- tm_map(myCorpus, content_transformer(stripWhitespace))

 tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
 print(tdm)
 print(dimnames(tdm)$Terms)

在此阶段，语料库不再是TextDocument，创建TermDocumentMatrix失败，错误为：inheritsdoc，TextDocument为not TRUE。已经记录了下一步应用PlainTextDocument函数

myCorpus <- tm_map(myCorpus, PlainTextDocument)

tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)

以下是输出：

<<TermDocumentMatrix (terms: 19, documents: 2)>>
Non-/sparse entries: 20/18
Sparsity           : 47%
Maximal term length: 9
Weighting          : term frequency (tf)
 [1] "all"       "cetera"    "concept"   "corpus"    "document" 
 [6] "function"  "have"      "into"      "modifi"    "onc"      
[11] "remov"     "stem"      "stopword"  "subsum"    "the"      
[16] "this"      "transform" "typic"     "want"

<TermDocumentMatrix (terms: 2, documents: 2)>>
Non-/sparse entries: 4/0
Sparsity           : 0%
Maximal term length: 7
Weighting          : term frequency (tf)
[1] "content" "meta"

调用PlainTextDocument已损坏语料库

预期词干单词将被完成：例如modifi=>modifier，onc=>one次，等等。

调用PlainTextDocument不会损坏语料库

你可能已经注意到，当你跑这条线的时候

myCorpus = tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)

您收到了几个警告消息：

值得一提的是,

这是如何使用您的数据进行阀杆完井的阀杆计算：

txt <- c("Once we have a corpus we typically want to modify the documents in it",
         "e.g., stemming, stopword removal, et cetera.",
         "In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
tdm      <- TermDocumentMatrix(myCorpus, control = list(stemming = TRUE)) 
cbind(stems = rownames(tdm), completed = stemCompletion(rownames(tdm), myCorpus))

要将更改永久写入TDM，请执行以下操作：

stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),
                                                         dictionary=dict, type="shortest"),sep="", 
                                          collapse=" ")))}

tdm <- stemCompletion_mod(rownames(tdm), myCorpus)  


tdm$content

[1] 所有cetera概念语料库文档的功能都已转换为NA 一旦删除，词干停止字就包含了这个转换通常想要

调用PlainTextDocument不会损坏语料库

你可能已经注意到，当你跑这条线的时候

myCorpus = tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)

您收到了几个警告消息：

值得一提的是,

这是如何使用您的数据进行阀杆完井的阀杆计算：

txt <- c("Once we have a corpus we typically want to modify the documents in it",
         "e.g., stemming, stopword removal, et cetera.",
         "In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
tdm      <- TermDocumentMatrix(myCorpus, control = list(stemming = TRUE)) 
cbind(stems = rownames(tdm), completed = stemCompletion(rownames(tdm), myCorpus))

要将更改永久写入TDM，请执行以下操作：

stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),
                                                         dictionary=dict, type="shortest"),sep="", 
                                          collapse=" ")))}

tdm <- stemCompletion_mod(rownames(tdm), myCorpus)  


tdm$content

[1] 所有cetera概念语料库文档的功能都已转换为NA 一旦删除，词干停止字就包含了这个转换通常想要

关于Hack-R的解决方案，我和Jason有相同的问题，我想让StemCompleted单词用于单词云，并作为TDM的一部分

由于stemCompletion不返回TDM，因此我从TDM中提取了术语，然后在此基础上运行stemCompletion

在测试时，我将它们分解为一个单独的变量

require(tm)
txt <- c("Once we have a corpus we typically want to modify the documents in it",
      "e.g., stemming, stopword removal, et cetera.",
      "In tm, all this functionality is subsumed into the concept of a transformation.")

myCorpus <- Corpus(VectorSource(txt))

myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpusCopy <- myCorpus

 # *Removing common word endings* (e.g., "ing", "es") 
myCorpus <- tm_map(myCorpus, stemDocument, language = "english")

 # Next, we remove all the empty spaces generated by isolating the
 # word stems in the previous step.
myCorpus <- tm_map(myCorpus, content_transformer(stripWhitespace))

tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)

由于stemCompletion似乎返回一个字符表，因此我将“tdm”的术语部分替换为stemCompleted版本：

tdm$dimnames$Terms <- as.character(stemCompletion(tdm$dimnames$Terms, myCorpusCopy, type = "prevalent"))
print(tdm$dimnames$Terms)

你会在单词上得到空白字段，显然，它不知道如何使用modifi，但至少这次你可以使用stemCompleted版本…

关于Hack-R的解决方案，我和Jason有相同的问题，我想在单词云中使用stemCompleted单词，并作为TDM的一部分

由于stemCompletion不返回TDM，因此我从TDM中提取了术语，然后在此基础上运行stemCompletion

在测试时，我将它们分解为一个单独的变量

require(tm)
txt <- c("Once we have a corpus we typically want to modify the documents in it",
      "e.g., stemming, stopword removal, et cetera.",
      "In tm, all this functionality is subsumed into the concept of a transformation.")

myCorpus <- Corpus(VectorSource(txt))

myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpusCopy <- myCorpus

 # *Removing common word endings* (e.g., "ing", "es") 
myCorpus <- tm_map(myCorpus, stemDocument, language = "english")

 # Next, we remove all the empty spaces generated by isolating the
 # word stems in the previous step.
myCorpus <- tm_map(myCorpus, content_transformer(stripWhitespace))

tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)

由于stemCompletion似乎返回一个字符表，因此我将“tdm”的术语部分替换为stemCompleted版本：

tdm$dimnames$Terms <- as.character(stemCompletion(tdm$dimnames$Terms, myCorpusCopy, type = "prevalent"))
print(tdm$dimnames$Terms)

显然，你会在不知道如何使用modifi的单词上看到空白字段，但至少这次你可以使用stemCompleted版本…

问题中提到的可能的重复，此问题已被重复，但尚未看到完整的答案，更常见的情况是，该问题不完全独立，例如，加载了文本文件。发现该问题后，答案对我不起作用。我猜是因为之前的错误，您收到了3条警告消息。当我按照下面更新的答案进行操作时，它适用于我的示例。希望这会有帮助，干杯。问题中提到的可能的副本，此问题已被重复，但尚未看到完整的答案，更常见的情况是，该问题不完全独立，例如，加载了文本文件。发现该问题后，答案对我不起作用。我猜是因为之前的错误，您收到了3条警告消息。当我按照下面更新的答案进行操作时，它适用于我的示例。希望对您有所帮助，干杯。获取stem已完成单词的列表很有用，但此时TermDocumentMatrix仍有未插入的单词，并且在wordcloud或其他软件包中使用tdm仍有未插入的单词。@JasonM1 OK我更新了它，将更改写回tdm。我从这里得到了这个函数：谢谢@Hack-R的回答，但是如果在上面的步骤之后使用带有tdm的wordcloud，wordcloud仍然显示未插入的术语。另外，当我运行上面的代码时，有一个空字符串，其中modify是为modifi列出的。在Windows上使用带有tm 0.6-2的R 3.3.1。获取stem已完成单词的列表非常有用，但此时TermDocumentMatrix仍有未插入的单词，在wordcloud或其他软件包中使用tdm仍有未插入的单词。@JasonM1 OK我已将其更新，以将更改写回tdm。我从这里得到了这个函数：谢谢@Hack-R的回答，但是如果在上面的步骤之后使用带有tdm的wordcloud，那么wordcloud仍然显示未插入的术语。另外，当我运行上面的代码时，在modify列出的地方有一个空字符串对莫迪菲来说。在Windows上使用R 3.3.1和tm 0.6-2。