R：使用循环解析html文件组_Html_Xml_R_Rcurl

R：使用循环解析html文件组

html xml r

R：使用循环解析html文件组,html,xml,r,rcurl,Html,Xml,R,Rcurl,以下代码适用于单个.html文件： doc <- htmlParse("New folder/1-4.html") plain.text <- xpathSApply(doc, "//td", xmlValue) plain.text <- gsub("\n", "", plain.text) gregexpr("firstThing", plain.text) firstThing <- substring(plain.text[9], 41, 50) gregexpr

以下代码适用于单个.html文件：

doc <- htmlParse("New folder/1-4.html")
plain.text <- xpathSApply(doc, "//td", xmlValue)
plain.text <- gsub("\n", "", plain.text)
gregexpr("firstThing", plain.text)
firstThing <- substring(plain.text[9], 41, 50)
gregexpr(secondThing, plain.text)
secondThing <- substring(plain.text[7], 1, 550)

doc两件事。首先，你走错了路。要解决此问题，请使用：
filenames = dir(path = "New folder", full.names = TRUE)

其次，比在for
循环中填充两个变量更好的方法是在列表函数中生成结构化数据：
result = lapply(filenames, function (filename) {
    doc = htmlParse(filename)
    plain_text = xpathSApply(doc, "//td", xmlValue)
    c(first = substring(plain_text[9], 41, 50),
      second = substring(plain_text[7], 1, 550))
})

现在，result
是一个元素列表，其中每个元素都是一个名为first
和second
的向量
其他几点意见：

小心变量名中的点-S3使用名称中的点来确定泛型方法的类。在变量名中使用点表示任何其他内容都会导致混淆，应避免使用
循环中的gsub
语句无效
是的，谢谢你指出这一点，但它仍然不能解决错误问题。实际上，这是在向每个单元格返回相同的信息，因此它似乎不会在文件中循环。有什么想法吗？@user2880936一个简单的打字错误，现在应该可以纠正了。太好了。谢谢你，康拉德。
result = lapply(filenames, function (filename) {
    doc = htmlParse(filename)
    plain_text = xpathSApply(doc, "//td", xmlValue)
    c(first = substring(plain_text[9], 41, 50),
      second = substring(plain_text[7], 1, 550))
})