在R中使用htmlpasse（）忽略不存在的URL_R_Html Parsing_Web Scraping

在R中使用htmlpasse（）忽略不存在的URL

r web-scraping

在R中使用htmlpasse（）忽略不存在的URL,r,html-parsing,web-scraping,R,Html Parsing,Web Scraping,大家好我有一个很长的地名列表（~15000），我想用它来查找wiki页面并从中提取数据。不幸的是，并非所有地方都有wiki页面，当htmlParse（）点击这些页面时，它会停止函数并返回错误 Error: failed to load HTTP resource 我无法遍历并删除所有创建不存在URL的地名，因此我想知道是否有办法让函数跳过没有wiki页面的地方 # Town names to be used towns <- data.frame('recID'

大家好

我有一个很长的地名列表（~15000），我想用它来查找wiki页面并从中提取数据。不幸的是，并非所有地方都有wiki页面，当htmlParse（）点击这些页面时，它会停止函数并返回错误

    Error: failed to load HTTP resource

我无法遍历并删除所有创建不存在URL的地名，因此我想知道是否有办法让函数跳过没有wiki页面的地方

    # Town names to be used
    towns <- data.frame('recID' = c('G62', 'G63', 'G64', 'G65'), 
                    'state' = c('Queensland', 'South_Australia', 'Victoria', 'Western_Australia'),
                    'name'  = c('Balgal Beach', 'Balhannah', 'Ballan', 'Yunderup'),
                    'feature' = c('POPL', 'POPL', 'POPL', 'POPL'))

    towns$state <- as.character(towns$state)

    towns$name <- sub(' ', '_', as.character(towns$name))

   # Function that extract data from wiki
   wiki.tables <- function(towns)  {
      require(RJSONIO)
      require(XML)
      u <- paste('http://en.wikipedia.org/wiki/',
                 sep = '', towns[,1], ',_', towns[,2])
      res <- lapply(u, function(x) htmlParse(x))
      tabs <- lapply(sapply(res, getNodeSet, path = '//*[@class="infobox vcard"]')
             , readHTMLTable)
      return(tabs)
    }

    # Now to run the function. Yunderup will produce a URL that 
    # doesn't exist. So this will result in the error.
    test <- wiki.tables(towns[,c('name', 'state')])

    # It works if I don't include the place that produces a non-existent URL.
    test <- wiki.tables(towns[1:3,c('name', 'state')])

#要使用的城镇名称
towns您可以使用“RCurl”中的“url.exists”函数`
需要（RCurl）
u sapply（u，url.exists）
http://en.wikipedia.org/wiki/Balgal_Beach昆士兰大学
真的
http://en.wikipedia.org/wiki/Balhannah澳大利亚南部
真的
http://en.wikipedia.org/wiki/Ballan维多利亚州
真的
http://en.wikipedia.org/wiki/Yunderup西澳大利亚州
真的
您可以使用“RCurl”中的“url.exists”函数`
需要（RCurl）
u sapply（u，url.exists）
http://en.wikipedia.org/wiki/Balgal_Beach昆士兰大学
真的
http://en.wikipedia.org/wiki/Balhannah澳大利亚南部
真的
http://en.wikipedia.org/wiki/Ballan维多利亚州
真的
http://en.wikipedia.org/wiki/Yunderup西澳大利亚州
真的
这里有另一个使用httr
包的选项。（顺便说一句：您不需要RJSONIO
）。将wiki.tables（…）
函数替换为以下内容：
wiki.tables <- function(towns)  {
  require(httr)
  require(XML)
  get.HTML<- function(url){
    resp <- GET(url)
    if (resp$status_code==200) return(htmlParse(content(resp,type="text")))
  }
  u <- paste('http://en.wikipedia.org/wiki/',
             sep = '', towns[,1], ',_', towns[,2])
  res <- lapply(u, get.HTML)
  res <- res[sapply(res,function(x)!is.null(x))]   # remove NULLs
  tabs <- lapply(sapply(res, getNodeSet, path = '//*[@class="infobox vcard"]')
                 , readHTMLTable)
  return(tabs)
}

wiki.tables这里有另一个使用httr
包的选项。（顺便说一句：您不需要RJSONIO
）。将wiki.tables（…）
函数替换为以下内容：
wiki.tables <- function(towns)  {
  require(httr)
  require(XML)
  get.HTML<- function(url){
    resp <- GET(url)
    if (resp$status_code==200) return(htmlParse(content(resp,type="text")))
  }
  u <- paste('http://en.wikipedia.org/wiki/',
             sep = '', towns[,1], ',_', towns[,2])
  res <- lapply(u, get.HTML)
  res <- res[sapply(res,function(x)!is.null(x))]   # remove NULLs
  tabs <- lapply(sapply(res, getNodeSet, path = '//*[@class="infobox vcard"]')
                 , readHTMLTable)
  return(tabs)
}

wiki.tables可能是函数中的一个if
语句，如果该语句丢失，它会告诉它移动到下一个url？运行代码时，我实际上会遇到以下错误：error in（function（class，fdef，mtable）：找不到函数“readHTMLTable”的继承方法，用于签名“XMLNodeSet”“
，不是与URL相关的错误。G'day Thomas，谢谢，我刚刚意识到了同样的事情。这是因为创建的Yunderup URL存在，但其中没有表，例如path='/*[@class=“infobox vcard”]“。我正试图找出如何放置一个if语句来检查我获取的html数据是否有此路径……正如下面的答案所示，URL确实存在，但它不是我想要的“正确”类型的wiki页面。可能在函数中有一个if
语句，告诉它如果缺少，则移动到下一个URL？运行代码，我实际上不会看到它。”以下错误：在（函数（类，fdef，mtable）中出错：找不到签名“XMLNodeSet”的函数“readHTMLTable”的继承方法“
，不是与URL相关的错误。G'day Thomas，谢谢，我刚刚意识到了同样的事情。这是因为创建的Yunderup URL存在，但其中没有表，例如path='/*[@class=“infobox vcard”]“。我正试图找出如何放置一个if语句来检查我获取的html数据是否有此路径……正如下面的答案所示，URL确实存在，但它不是我想要的“正确”类型的wiki页面。谢谢jlhoward。是的，我意识到不久后URL确实存在，它是getNodeSet，path='/*[@class=“infobox vcard”]’，那不是。所以我真的有两个层次的解析要做。去掉那些没有URL的地方，你已经为它们提供了一个很好的解决方案，并且只在适当的地方从这些URL提取标签。我很快会尝试找出第二步，可能是使用if语句。谢谢jlhoward。是的，我意识到了一点在url确实存在之后，它就是getNodeSet，path='/*[@class=“infobox vcard”]，那不是。所以我真的有两个层次的解析要做。去掉那些没有URL的地方，你已经为它们提供了一个很好的解决方案，并且只在合适的地方从这些URL中提取标签。我将很快尝试找出第二步，可能是使用if语句。