Xml 抓取网页、网页上的链接，并使用R形成表格_Xml_R_Web Scraping

Xml 抓取网页、网页上的链接，并使用R形成表格

xml r web-scraping

Xml 抓取网页、网页上的链接，并使用R形成表格,xml,r,web-scraping,Xml,R,Web Scraping,您好，我不熟悉使用R从互联网上获取数据，遗憾的是，我对HTML和XML知之甚少。我正在尝试从以下父页面中删除每个故事链接：。我不关心父页面上的任何其他链接，但需要创建一个表，其中包含每个故事URL的行和相应URL的列、故事标题、日期（总是在故事标题后的第一句开头），然后是页面的其余文本（可以是几段文本）. 我曾尝试在（和几个相关线程）修改代码，但遇到了困难。如有任何建议或建议，将不胜感激。以下是我迄今为止所做的尝试（用“遇到麻烦的地方”一词）： rm（list=ls（））库（XML）图

您好，我不熟悉使用R从互联网上获取数据，遗憾的是，我对HTML和XML知之甚少。我正在尝试从以下父页面中删除每个故事链接：。我不关心父页面上的任何其他链接，但需要创建一个表，其中包含每个故事URL的行和相应URL的列、故事标题、日期（总是在故事标题后的第一句开头），然后是页面的其余文本（可以是几段文本）.

我曾尝试在（和几个相关线程）修改代码，但遇到了困难。如有任何建议或建议，将不胜感激。以下是我迄今为止所做的尝试（用“遇到麻烦的地方”一词）：

rm（list=ls（））
库（XML）
图书馆（plyr）
url='1〕http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc您可以xpathapply
（相当于lapply），在给定Xpath的文档中进行搜索
library(XML)
url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)
data.frame(
  dates =  xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlValue),
  hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlGetAttr,'href'),
  story = xpathSApply(doc, '//*[@class="link_info"]/text()',xmlValue))

 ##               dates                                                hrefs
## 1      26 June 2013             /entity/csr/don/2013_06_26/en/index.html
## 2      23 June 2013             /entity/csr/don/2013_06_23/en/index.html
## 3      22 June 2013             /entity/csr/don/2013_06_22/en/index.html
## 4      17 June 2013             /entity/csr/don/2013_06_17/en/index.html

##                                                                                    story
## 1                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 2                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 3                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 4                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update

库（XML）
url='1〕http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc你能将预期结果添加到你的链接中至少一行你的最终数据框架吗？太好了，非常感谢！你知道如何转到第2列中命名的每个url并将该页面的文本保存为数据框的第4列吗？@user2535366所有文本？：）这有意义吗？嗯，可能没有，但这从未阻止过我：）最终，我想搜索这些故事中的每一个，寻找特定的术语，这似乎是一种方便的收集方式。@user2535366请欣赏我的编辑。建议，阅读wpath，这是一条路（没有什么是神奇的，只是一些学习曲线）。你还可以看到如何在你的最后一篇文章中搜索一些单词。我已经成功地运用了上面的提示，写了几年我想搜集的故事。然而，最后一个问题（希望如此）。如果在URL中将2013年替换为2011年、2008年和2006年，则脚本将失败并产生错误，例如，对于2011年，“data.frame（dates=xpathsaply（doc）”/*[@class=\“auto\u archive\”]/li/a”，参数表示行数不同：59、60.查看中的原始HTML文件，我发现与脚本工作的2012年和2013年URL相比，在结构上没有明显差异。想法？
library(XML)
url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)
data.frame(
  dates =  xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlValue),
  hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlGetAttr,'href'),
  story = xpathSApply(doc, '//*[@class="link_info"]/text()',xmlValue))

 ##               dates                                                hrefs
## 1      26 June 2013             /entity/csr/don/2013_06_26/en/index.html
## 2      23 June 2013             /entity/csr/don/2013_06_23/en/index.html
## 3      22 June 2013             /entity/csr/don/2013_06_22/en/index.html
## 4      17 June 2013             /entity/csr/don/2013_06_17/en/index.html

##                                                                                    story
## 1                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 2                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 3                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 4                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update

dat$text = unlist(lapply(dat$hrefs,function(x)
  {
    url.story <- gsub('/entity','http://www.who.int',x)
    texts <- xpathSApply(htmlParse(url.story), 
                         '//*[@id="primary"]',xmlValue)
    }))