Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/cocoa/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
从R中的许多html文件创建语料库_Html_R_Xml Parsing_Text Mining_Corpus - Fatal编程技术网

从R中的许多html文件创建语料库

从R中的许多html文件创建语料库,html,r,xml-parsing,text-mining,corpus,Html,R,Xml Parsing,Text Mining,Corpus,我想为下载的HTML文件的集合创建一个语料库,然后在R中读取它们,以便将来进行文本挖掘 基本上,这就是我想要做的: 从多个html文件创建语料库 我尝试使用DirSource: library(tm) a<- DirSource("C:/test") b<-Corpus(DirSource(a), readerControl=list(language="eng", reader=readPlain)) library(tm) a这将纠正错误 b<-Corpus(a,

我想为下载的HTML文件的集合创建一个语料库,然后在R中读取它们,以便将来进行文本挖掘

基本上,这就是我想要做的:

  • 从多个html文件创建语料库
我尝试使用DirSource:

library(tm)
a<- DirSource("C:/test")
b<-Corpus(DirSource(a), readerControl=list(language="eng", reader=readPlain))
library(tm)

a这将纠正错误

 b<-Corpus(a, ## I change DireSource(a) by a
          readerControl=list(language="eng", reader=readPlain))

b将所有html文件读入可以使用的R对象

# Set variables
folder <- 'C:/test'
extension <- '.htm'

# Get the names of *.html files in the folder
files <- list.files(path=folder, pattern=extension)

# Read all the files into a list
htmls <- lapply(X=files,
                FUN=function(file){
                 .con <- file(description=paste(folder, file, sep='/'))
                 .html <- readLines(.con)
                 close(.con)
                 names(.html)  <- file
                 .html
})
#设置变量

文件夹这应该可以。在这里,我在我的计算机上有一个包含HTML文件的文件夹(来自SO的随机样本),我用它们制作了一个语料库,然后是一个文档术语矩阵,然后做了一些简单的文本挖掘任务

# get data
setwd("C:/Downloads/html") # this folder has your HTML files 
html <- list.files(pattern="\\.(htm|html)$") # get just .htm and .html files

# load packages
library(tm)
library(RCurl)
library(XML)
# get some code from github to convert HTML to text
writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")))
source("htmlToText.R")
# convert HTML to text
html2txt <- lapply(html, htmlToText)
# clean out non-ASCII characters
html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub=""))

# make corpus for text mining
corpus <- Corpus(VectorSource(html2txtclean))

# process text...
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(a, PlainTextDocument)
a <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs)
a.dtm1 <- TermDocumentMatrix(a, control = list(wordLengths = c(3,10))) 
newstopwords <- findFreqTerms(a.dtm1, lowfreq=10) # get most frequent words
# remove most frequent words for this corpus
a.dtm2 <- a.dtm1[!(a.dtm1$dimnames$Terms) %in% newstopwords,] 
inspect(a.dtm2)

# carry on with typical things that can now be done, ie. cluster analysis
a.dtm3 <- removeSparseTerms(a.dtm2, sparse=0.7)
a.dtm.df <- as.data.frame(inspect(a.dtm3))
a.dtm.df.scale <- scale(a.dtm.df)
d <- dist(a.dtm.df.scale, method = "euclidean") 
fit <- hclust(d, method="ward")
plot(fit)

我发现这个包对于只提取html页面的“核心”文本特别有用

尝试在DirSource调用中使用反斜杠而不是正斜杠<代码>C:\test
什么包是
Corpus
DirSource
命令?很好。只需在
list.files(…)
中添加参数
pattern=“.html”
,文件夹中就可以有其他文件(例如下载数据的R脚本、自述文件和任何其他非html文件,当然带“html”的文件除外)在他们的名字中。为了让这个在tm 0.6中工作,请将您的语料库转换为明文文档,否则您将无法创建TDM。请执行以下操作:
# get data
setwd("C:/Downloads/html") # this folder has your HTML files 
html <- list.files(pattern="\\.(htm|html)$") # get just .htm and .html files

# load packages
library(tm)
library(RCurl)
library(XML)
# get some code from github to convert HTML to text
writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")))
source("htmlToText.R")
# convert HTML to text
html2txt <- lapply(html, htmlToText)
# clean out non-ASCII characters
html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub=""))

# make corpus for text mining
corpus <- Corpus(VectorSource(html2txtclean))

# process text...
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(a, PlainTextDocument)
a <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs)
a.dtm1 <- TermDocumentMatrix(a, control = list(wordLengths = c(3,10))) 
newstopwords <- findFreqTerms(a.dtm1, lowfreq=10) # get most frequent words
# remove most frequent words for this corpus
a.dtm2 <- a.dtm1[!(a.dtm1$dimnames$Terms) %in% newstopwords,] 
inspect(a.dtm2)

# carry on with typical things that can now be done, ie. cluster analysis
a.dtm3 <- removeSparseTerms(a.dtm2, sparse=0.7)
a.dtm.df <- as.data.frame(inspect(a.dtm3))
a.dtm.df.scale <- scale(a.dtm.df)
d <- dist(a.dtm.df.scale, method = "euclidean") 
fit <- hclust(d, method="ward")
plot(fit)
# just for fun... 
library(wordcloud)
library(RColorBrewer)

m = as.matrix(t(a.dtm1))
# get word counts in decreasing order
word_freqs = sort(colSums(m), decreasing=TRUE) 
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
# plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))