Xml 如何在R中从已删除的网页中分离单个元素_Xml_R_Web Scraping_Rcurl

Xml 如何在R中从已删除的网页中分离单个元素

xml r web-scraping

Xml 如何在R中从已删除的网页中分离单个元素,xml,r,web-scraping,rcurl,Xml,R,Web Scraping,Rcurl,我想用R来抓取这一页：（）和其他，以获得进球得分者和次数到目前为止，我得到的是： require(RCurl) require(XML) theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html" webpage <- getURL(theURL, header=FALSE, verbose=TRUE) webpagecont <

我想用R来抓取这一页：（）和其他，以获得进球得分者和次数

到目前为止，我得到的是：

require(RCurl)
require(XML)

theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
webpage <- getURL(theURL, header=FALSE, verbose=TRUE) 
webpagecont <- readLines(tc <- textConnection(webpage)); close(tc)  

pagetree <- htmlTreeParse(webpagecont, error=function(...){}, useInternalNodes = TRUE)

require（RCurl）
需要（XML）
URL在R中处理web抓取和XML时，这些问题非常有用：


关于您的特定示例，虽然我不确定您希望输出是什么样子，但这会将“得分”作为字符向量：
theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
fifa.doc <- htmlParse(theURL)
fifa <- xpathSApply(fifa.doc, "//*/div[@class='cont']", xmlValue)
goals.scored <- grep("Goals scored", fifa, value=TRUE)

做这样的事情时要小心。。。在大多数情况下，国际足联或国际篮联、NBA等组织不允许使用他们的数据——简单地说：他们的数据是他们的财产！所以下次提供一些伪HTML代码，或者只指向一些无害的站点很不错，我很久以前就在寻找类似的东西，但最终还是在Python中找到了！现在我可以运行littler脚本并填充数据集了！酷！
theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
fifa.doc <- htmlParse(theURL)
fifa <- xpathSApply(fifa.doc, "//*/div[@class='cont']", xmlValue)
goals.scored <- grep("Goals scored", fifa, value=TRUE)

> gsub("Goals scored", "", strsplit(goals.scored, ", ")[[1]])
[1] "Philipp LAHM (GER) 6'"    "Paulo WANCHOPE (CRC) 12'" "Miroslav KLOSE (GER) 17'" "Miroslav KLOSE (GER) 61'" "Paulo WANCHOPE (CRC) 73'"
[6] "Torsten FRINGS (GER) 87'"