Xml R语料库中每个文档的XPath_Xml_R_Xpath_Text Mining_Corpus

Xml R语料库中每个文档的XPath

xml r xpath

Xml R语料库中每个文档的XPath,xml,r,xpath,text-mining,corpus,Xml,R,Xpath,Text Mining,Corpus,我有一个语料库x，在R中，它是使用DirSource从一个目录创建的。每个文档都是一个文本文件，包含vBulletin论坛相关网页的完整HTML。因为它是一个线程，所以每个文档都有多个单独的帖子，我想用XPath捕获它们。XPath似乎有效，但我无法将所有捕获的节点放回语料库中如果我的语料库有25个文档，每个文档平均有4篇文章，那么我的新语料库应该有100个文档。我想知道我是否需要做一个循环并创建一个新的语料库这是我迄今为止的混乱工作。来自www.vbulletin.org/forum/中线

我有一个语料库x，在R中，它是使用DirSource从一个目录创建的。每个文档都是一个文本文件，包含vBulletin论坛相关网页的完整HTML。因为它是一个线程，所以每个文档都有多个单独的帖子，我想用XPath捕获它们。XPath似乎有效，但我无法将所有捕获的节点放回语料库中

如果我的语料库有25个文档，每个文档平均有4篇文章，那么我的新语料库应该有100个文档。我想知道我是否需要做一个循环并创建一个新的语料库

这是我迄今为止的混乱工作。来自www.vbulletin.org/forum/中线程的任何源都是该结构的一个示例

#for stepping through
xt <- x[[5]]
xpath <- "//div[contains(@id,'post_message')]"

getxpath <- function(xt,xpath){
  require(XML)

  #either parse
  doc <- htmlParse(file=xt)
  #doc <- htmlTreeParse(tolower(xt), asText = TRUE, useInternalNodes = TRUE)

  #don't know which to use
  #result <- xpathApply(doc,xpath,xmlValue)
  result <- xpathSApply(doc,xpath,xmlValue)

  #clean up
  result <- gsub(pattern="\\s+",replacement=" ",x=gsub(pattern="\n|\t",replacement=" ",x=result))

  result <- c(result[1:length(result)])

  free(doc)

  #converts group of nodes into 1 data frame with numbers before separate posts
  #require(plyr)
  #xbythread <- ldply(.data=result,.fun=function(x){unlist(x)})

  #don't know what needs to be returned
  result <- Corpus(VectorSource(result))
  #result <- as.PlainTextDocument(result)

  return(result)
}

#call
x2 <- tm_map(x=x,FUN=getxpath,"//div[contains(@id,'post_message')]")

#用于单步执行
xt不久前就知道了。htmlParse需要isURL=TRUE
getxpath <- function(xt,xpath){
  require(XML);require(tm)
  x <- htmlParse(file=u,isURL=TRUE)
  resultvector <- xpathSApply(x,xpath,xmlValue)
  result <- gsub(pattern="\\s+",replacement=" ",x=gsub(pattern="\n|\t",replacement=" ",x=resultvector))
  return(result)
}

res <- getxpath("http://url.com/board.html","//xpath")

getxpath