R 具有词频的文本挖掘pdf文件/问题_R_Pdf_Ghostscript_Tm_Text Recognition

R 具有词频的文本挖掘pdf文件/问题

r pdf

R 具有词频的文本挖掘pdf文件/问题,r,pdf,ghostscript,tm,text-recognition,R,Pdf,Ghostscript,Tm,Text Recognition,我正在尝试挖掘一篇pdf格式的文章，其中包含丰富的pdf编码和图表。我注意到，当我挖掘一些pdf文档时，我得到的高频词是phi、taeoe、toe、sigma、gamma等。它与一些pdf文档配合得很好，但我与其他文档一起得到这些随机的希腊字母。这就是字符编码的问题吗？（顺便说一句，所有文件都是英文的）。有什么建议吗 # Here is the link to pdf file for testing # www.sciencedirect.com/science/article/pii/S01

我正在尝试挖掘一篇pdf格式的文章，其中包含丰富的pdf编码和图表。我注意到，当我挖掘一些pdf文档时，我得到的高频词是phi、taeoe、toe、sigma、gamma等。它与一些pdf文档配合得很好，但我与其他文档一起得到这些随机的希腊字母。这就是字符编码的问题吗？（顺便说一句，所有文件都是英文的）。有什么建议吗

# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
 pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                              language = "en",
                                              id = "id1")
 content(pdf)[1:4]
 }


docs<- Corpus(URISource(uri, mode = ""),
    readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)  
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeWords, stopwords("english")) 

library(SnowballC)   
docs <- tm_map(docs, stemDocument)  
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument)  

dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs) 
freq <- colSums(as.matrix(dtm))   
length(freq)  
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)] 
freq[tail(ord)]

#这里是测试用pdf文件的链接
#www.sciencedirect.com/science/article/pii/S0164121212000532
图书馆（tm）
uri我认为，ghostscript
在这里制造了所有的麻烦。假设正确安装了pdfinfo
和pdftotext
，此代码不会生成您提到的奇怪单词：
library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                               language = "en",
                                               id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)  
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeWords, stopwords("english")) 
docs <- tm_map(docs, removePunctuation) 
library(SnowballC)   
docs <- tm_map(docs, stemDocument)  
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument)  
dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs) 
freq <- colSums(as.matrix(dtm))


显然，这个结果并不完美；主要是因为词干分析几乎无法获得100%可靠的结果（例如，我们仍然将“问题”和“问题”作为单独的词；或者“方法”和“方法”）。我不知道R中有任何可靠的词干生成算法，即使SnowballC
做得相当好
 我做了更改，但我仍然使用希腊语单词delta、toe等作为高频词作为解决方法，您可以使用my_stopwords删除不需要的单词，这就是问题所在。即使是低频词也像aaa、zutng zwu zwzuz zxanug！所以我们真的需要弄清楚pdf是如何在包中被读取的。我对您使用engine=“ghostscript”
感到有些惊讶。第一行表明，如果可用，您使用的是xpdf
标准引擎、pdftotext
和pdfinfo
。为什么事后要写鬼脚本。。。？变量pdf
似乎不再使用。我可能会使用docs之类的东西，我猜这就是ghostscript如何读取/解释pdf的问题@RHertel你解决这个难题的方向是正确的！
library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))