R错误：仅适用于角色对象_R_Text_Nlp_Character_Text Mining

R错误：仅适用于角色对象

r text nlp

R错误：仅适用于角色对象,r,text,nlp,character,text-mining,R,Text,Nlp,Character,Text Mining,我正在使用R编程语言。我试图复制前面的stackoverflow帖子，目的是标记化和删除停止词使用一些公开的莎士比亚戏剧，我使用3部戏剧创建了一个术语文档矩阵： #load libraries library(dplyr) library(pdftools) library(tidytext) library(textrank) library(tm) #1st document url <- "https://shakespeare.folger.edu/downloads

我正在使用R编程语言。我试图复制前面的stackoverflow帖子，目的是标记化和删除停止词

使用一些公开的莎士比亚戏剧，我使用3部戏剧创建了一个术语文档矩阵：

#load libraries
library(dplyr)
library(pdftools)
library(tidytext)
library(textrank)
library(tm)

#1st document
url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_1 <- article_words %>%
  anti_join(stop_words, by = "word")

#2nd document
url <- "https://shakespeare.folger.edu/downloads/pdf/macbeth_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_2<- article_words %>%
  anti_join(stop_words, by = "word")


#3rd document
url <- "https://shakespeare.folger.edu/downloads/pdf/othello_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_3 <- article_words %>%
  anti_join(stop_words, by = "word")

从这里，我创建了实际的术语文档矩阵：

library(tm)

#create term document matrix
tdm <- TermDocumentMatrix(Corpus(VectorSource(rbind(article_words_1, article_words_2, article_words_3))))

#inspect the "term document matrix" (I don't know why this is producing an error)
inspect(tdm)

在此之后，我尝试使用两种不同的方法执行标记化并删除停止字源：

这两个步骤都会导致错误，表明这些函数仅适用于字符、语料库、列表或标记对象。有没有办法在我创建的术语文档矩阵中使用这些函数

谢谢

你的目标是什么？因为article_words_1已经标记化，stopwords被删除。tidytext具有使用cast_xx命令将data.frames转换为dtm、tdm或dfm的功能。@phiver:谢谢您的回复！这只是我想到的莎士比亚戏剧的一个例子。我的真实数据没有标记，并且有停止词。我的真实数据已经是一个术语文档矩阵。有没有一种方法可以标记和删除术语文档矩阵中的停止词？谢谢有人知道如何将它转换成语料库对象吗？你不能将tdm转换成语料库，因为它已经被分割，并且特征已经被计数。但我知道你想要实现什么。我将用一个例子来回答你提出的新问题。请关闭/删除此项。

library(quanteda)

#first method:

first_method <- tokens(tdm) %>%
  tokens_remove(stopwords("en"), pad = TRUE)

Error: tokens() only works on character, corpus, list, tokens objects.


#second method:

second_method <- dfm(text, remove_punct = TRUE) %>%
  dfm_remove(stopwords("en"))

Error: dfm() only works on character, corpus, list, tokens objects.