Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/83.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R错误:仅适用于角色对象_R_Text_Nlp_Character_Text Mining - Fatal编程技术网

R错误:仅适用于角色对象

R错误:仅适用于角色对象,r,text,nlp,character,text-mining,R,Text,Nlp,Character,Text Mining,我正在使用R编程语言。我试图复制前面的stackoverflow帖子,目的是标记化和删除停止词 使用一些公开的莎士比亚戏剧,我使用3部戏剧创建了一个术语文档矩阵: #load libraries library(dplyr) library(pdftools) library(tidytext) library(textrank) library(tm) #1st document url <- "https://shakespeare.folger.edu/downloads

我正在使用R编程语言。我试图复制前面的stackoverflow帖子,目的是标记化和删除停止词

使用一些公开的莎士比亚戏剧,我使用3部戏剧创建了一个术语文档矩阵:

#load libraries
library(dplyr)
library(pdftools)
library(tidytext)
library(textrank)
library(tm)

#1st document
url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_1 <- article_words %>%
  anti_join(stop_words, by = "word")

#2nd document
url <- "https://shakespeare.folger.edu/downloads/pdf/macbeth_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_2<- article_words %>%
  anti_join(stop_words, by = "word")


#3rd document
url <- "https://shakespeare.folger.edu/downloads/pdf/othello_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_3 <- article_words %>%
  anti_join(stop_words, by = "word")
从这里,我创建了实际的术语文档矩阵:

library(tm)

#create term document matrix
tdm <- TermDocumentMatrix(Corpus(VectorSource(rbind(article_words_1, article_words_2, article_words_3))))

#inspect the "term document matrix" (I don't know why this is producing an error)
inspect(tdm)
在此之后,我尝试使用两种不同的方法执行标记化并删除停止字源:

这两个步骤都会导致错误,表明这些函数仅适用于字符、语料库、列表或标记对象。有没有办法在我创建的术语文档矩阵中使用这些函数


谢谢

你的目标是什么?因为article_words_1已经标记化,stopwords被删除。tidytext具有使用cast_xx命令将data.frames转换为dtm、tdm或dfm的功能。@phiver:谢谢您的回复!这只是我想到的莎士比亚戏剧的一个例子。我的真实数据没有标记,并且有停止词。我的真实数据已经是一个术语文档矩阵。有没有一种方法可以标记和删除术语文档矩阵中的停止词?谢谢有人知道如何将它转换成语料库对象吗?你不能将tdm转换成语料库,因为它已经被分割,并且特征已经被计数。但我知道你想要实现什么。我将用一个例子来回答你提出的新问题。请关闭/删除此项。
library(quanteda)

#first method:

first_method <- tokens(tdm) %>%
  tokens_remove(stopwords("en"), pad = TRUE)

Error: tokens() only works on character, corpus, list, tokens objects.


#second method:

second_method <- dfm(text, remove_punct = TRUE) %>%
  dfm_remove(stopwords("en"))

Error: dfm() only works on character, corpus, list, tokens objects.