在R中刮黄页_R_Web Scraping - Fatal编程技术网

在R中刮黄页

r web-scraping

在R中刮黄页,r,web-scraping,R,Web Scraping,我正试图从一份水管工的名单中找出一份来建造一个tibble 代码可以很好地处理每个部分（姓名、电话号码、电子邮件），但当我将其放在一个函数中构建tibble时，它会出错，因为有些部分没有电话号码或电子邮件 url <- "https://www.yellowpages.com.au/search/listings?clue=plumbers&locationClue=Greater+Sydney%2C+NSW&lat=&lon=&selectedV

我正试图从一份水管工的名单中找出一份来建造一个tibble

代码可以很好地处理每个部分（姓名、电话号码、电子邮件），但当我将其放在一个函数中构建tibble时，它会出错，因为有些部分没有电话号码或电子邮件

url <- "https://www.yellowpages.com.au/search/listings?clue=plumbers&locationClue=Greater+Sydney%2C+NSW&lat=&lon=&selectedViewMode=list"

testscrape <- function(){
  webpage <- read_html(url)
  
  docname <- webpage %>%
    html_nodes(".left .listing-name") %>%
    html_text()
  
  ph_no <- webpage %>%
    html_nodes(".contact-phone .contact-text") %>%
    html_text()
  
  email <- webpage %>%
    html_nodes(".contact-email") %>%
    html_attr("href") %>%
    as.character() %>%
    str_remove_all(".*:") %>%
    str_remove_all("\\?(.*)") %>%
    str_replace_all("%40","@")
  
  return(tibble(docname = docname, ph_no = ph_no, email = email))
}

这让它悬着

我知道电话号码比列出的管道工少，所以我如何为该管道工的电话号码创建N/a报税表，以便号码与相关管道工一致

提前感谢。

您可以将提取的数据子集以获得第一个值，当值为空时，该值将给出NA

library(rvest)
library(stringr)

testscrape <- function(url){
  webpage <- read_html(url)
  
  docname <- webpage %>%
    html_nodes(".left .listing-name") %>%
    html_text()
  
  ph_no <- webpage %>%
    html_nodes(".contact-phone .contact-text") %>%
    html_text()
  
  email <- webpage %>%
    html_nodes(".contact-email") %>%
    html_attr("href") %>%
    as.character() %>%
    str_remove_all(".*:") %>%
    str_remove_all("\\?(.*)") %>%
    str_replace_all("%40","@")
    n <- seq_len(max(length(practice), length(ph_no), length(email)))
    tibble(docname = practice[n], ph_no = ph_no[n], email = email[n])
}
testscrape(url)

# docname ph_no email
#  <lgl>   <lgl> <lgl>
#1   NA      NA    NA

库（rvest）
图书馆（stringr）
测试刮除%
html_text（）
电子邮件%
html_节点（“.contact email”）%>%
html_attr（“href”）%%>%
as.character（）%>%
str_remove_all（“.*”）%>%
str\u remove\u all（“\\？（.*））%>%
str_replace_all（“%40”，“@”）
因为我复制粘贴了错误的东西。很抱歉你能试一下更新后的答案吗？你能提供几个链接让我测试我的答案吗？顺便说一句，您共享的url
返回docname
，ph\u no
和email
为NA
。对吗？此外，我还更新了答案，并进行了另一次修改，可能会有所帮助。即使对于此链接，docname
，ph\u no
和电子邮件都是空的。你确定你的代码有效吗？在你的问题中，你提到了代码在每个部分都很好地工作，但是我在你分享的任何示例中都找不到它，所以我不知道到底需要调试什么。让我们来看看。
Error: Tibble columns must have compatible sizes.
* Size 36: Existing data.
* Size 17: Column `ph_no`.
ℹ Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)
Browse[1]> 

library(rvest)
library(stringr)

testscrape <- function(url){
  webpage <- read_html(url)
  
  docname <- webpage %>%
    html_nodes(".left .listing-name") %>%
    html_text()
  
  ph_no <- webpage %>%
    html_nodes(".contact-phone .contact-text") %>%
    html_text()
  
  email <- webpage %>%
    html_nodes(".contact-email") %>%
    html_attr("href") %>%
    as.character() %>%
    str_remove_all(".*:") %>%
    str_remove_all("\\?(.*)") %>%
    str_replace_all("%40","@")
    n <- seq_len(max(length(practice), length(ph_no), length(email)))
    tibble(docname = practice[n], ph_no = ph_no[n], email = email[n])
}
testscrape(url)

# docname ph_no email
#  <lgl>   <lgl> <lgl>
#1   NA      NA    NA