从数据帧循环和读取时出现read_html问题_R_Rvest

从数据帧循环和读取时出现read_html问题

从数据帧循环和读取时出现read_html问题,r,rvest,R,Rvest,我有一个包含100行URL的数据框，我想从中提取数据。如果要将URL复制到其中一行并保存在变量“URL”中，请执行以下操作： webpage <- read_html(url, encoding = "windows-874"). 函数读取url getBatchAnalystData <- function(URLRow){ url <- URLSetTrade[URLRow,1] webpage <- read_html(url, encoding =

我有一个包含100行URL的数据框，我想从中提取数据。如果要将URL复制到其中一行并保存在变量“URL”中，请执行以下操作：

 webpage <- read_html(url, encoding = "windows-874").

函数读取url

getBatchAnalystData <- function(URLRow){
  url <- URLSetTrade[URLRow,1]
  webpage <- read_html(url, encoding = "windows-874") 
  target_price_html <- html_nodes(webpage,'td')
  target_price_data <- html_text(target_price_html)
  sub("-","",target_price_data)

调用readurl函数的循环

 for (i in 1:nrow(URLSetTrade)) {  assign(paste0(EarningDate2019_07_28_Cleaned[i,1],
  "AnalystData"),getBatchAnalystData(i))
    }

据我所知，xml2:：read_html要求其输入x是单个url，而不是长度为2或更多url的向量。使用单个函数调用读取多个URL的一种方法是使用类似于purrr:：map的函数。在我们的例子中，purrr:：map将列表和函数xml2:：read_html作为输入，并返回将函数应用于输入列表的每个元素的结果。如果您以前没有使用过purrr，则可能需要从CRAN安装purrr

mylist <- list("http://nytimes.com", "http://economist.com")
purrr::map(mylist, xml2::read_html)

其中mylist现在是您真正想要的URL列表，而不是我在上面粘贴的URL。我希望这会有所帮助。

您的第1行和第2行是否包含相同的url？它们应该是不同的URL吗？这是两行之间的细微差别。在第一行->txtSymbol=wha和第二行->txtSymbol=PRM非常感谢你，弗雷德。您的解决方案非常有效！！但是，现在read_html正在工作，错误会传播到下一行。html_nodeswebpage，'td'在UseMethodxml_find_all中返回一个错误：没有适用于'xml_find_all'的方法应用于类列表的对象我觉得这很奇怪，因为如果我使用webpage运行代码而不使用循环，我很高兴它能工作。purrr:：map的输出是一个列表，因此您需要这样处理它。在继续之前，能否成功运行html_nodesmylist[[1]]，td。如果是，那么您应该能够将使用html_节点的行替换为对purr:：map的调用，类似于purr:：maptmls，html_节点。在代码中设置为“td”的参数的名称是什么？您需要将这个参数及其名称添加到对purrr:：map`.的调用中。哦，我应该说上面代码中的htmls是purrr:：mapmylist的输出，xml2:：readhtmlproblem solved。再次非常感谢您的帮助。

 for (i in 1:nrow(URLSetTrade)) {  assign(paste0(EarningDate2019_07_28_Cleaned[i,1],
  "AnalystData"),getBatchAnalystData(i))
    }

mylist <- list("http://nytimes.com", "http://economist.com")
purrr::map(mylist, xml2::read_html)

purrr::map(mylist, xml2::read_html, encoding = "windows-874")