Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/73.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/video/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在R中刮黄页_R_Web Scraping - Fatal编程技术网

在R中刮黄页

在R中刮黄页,r,web-scraping,R,Web Scraping,我正试图从一份水管工的名单中找出一份来建造一个tibble 代码可以很好地处理每个部分(姓名、电话号码、电子邮件),但当我将其放在一个函数中构建tibble时,它会出错,因为有些部分没有电话号码或电子邮件 url <- "https://www.yellowpages.com.au/search/listings?clue=plumbers&locationClue=Greater+Sydney%2C+NSW&lat=&lon=&selectedV

我正试图从一份水管工的名单中找出一份来建造一个tibble

代码可以很好地处理每个部分(姓名、电话号码、电子邮件),但当我将其放在一个函数中构建tibble时,它会出错,因为有些部分没有电话号码或电子邮件

url <- "https://www.yellowpages.com.au/search/listings?clue=plumbers&locationClue=Greater+Sydney%2C+NSW&lat=&lon=&selectedViewMode=list"

testscrape <- function(){
  webpage <- read_html(url)
  
  docname <- webpage %>%
    html_nodes(".left .listing-name") %>%
    html_text()
  
  ph_no <- webpage %>%
    html_nodes(".contact-phone .contact-text") %>%
    html_text()
  
  email <- webpage %>%
    html_nodes(".contact-email") %>%
    html_attr("href") %>%
    as.character() %>%
    str_remove_all(".*:") %>%
    str_remove_all("\\?(.*)") %>%
    str_replace_all("%40","@")
  
  return(tibble(docname = docname, ph_no = ph_no, email = email))
}
这让它悬着

我知道电话号码比列出的管道工少,所以我如何为该管道工的电话号码创建N/a报税表,以便号码与相关管道工一致


提前感谢。

您可以将提取的数据子集以获得第一个值,当值为空时,该值将给出NA

library(rvest)
library(stringr)

testscrape <- function(url){
  webpage <- read_html(url)
  
  docname <- webpage %>%
    html_nodes(".left .listing-name") %>%
    html_text()
  
  ph_no <- webpage %>%
    html_nodes(".contact-phone .contact-text") %>%
    html_text()
  
  email <- webpage %>%
    html_nodes(".contact-email") %>%
    html_attr("href") %>%
    as.character() %>%
    str_remove_all(".*:") %>%
    str_remove_all("\\?(.*)") %>%
    str_replace_all("%40","@")
    n <- seq_len(max(length(practice), length(ph_no), length(email)))
    tibble(docname = practice[n], ph_no = ph_no[n], email = email[n])
}
testscrape(url)

# docname ph_no email
#  <lgl>   <lgl> <lgl>
#1   NA      NA    NA   
库(rvest)
图书馆(stringr)
测试刮除%
html_text()
电子邮件%
html_节点(“.contact email”)%>%
html_attr(“href”)%%>%
as.character()%>%
str_remove_all(“.*”)%>%
str\u remove\u all(“\\?(.*))%>%
str_replace_all(“%40”,“@”)

因为我复制粘贴了错误的东西。很抱歉你能试一下更新后的答案吗?你能提供几个链接让我测试我的答案吗?顺便说一句,您共享的
url
返回
docname
ph\u no
email
NA
。对吗?此外,我还更新了答案,并进行了另一次修改,可能会有所帮助。即使对于此链接,
docname
ph\u no
电子邮件都是空的。你确定你的代码有效吗?在你的问题中,你提到了
代码在每个部分都很好地工作,但是我在你分享的任何示例中都找不到它,所以我不知道到底需要调试什么。让我们来看看。
Error: Tibble columns must have compatible sizes.
* Size 36: Existing data.
* Size 17: Column `ph_no`.
ℹ Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)
Browse[1]> 
library(rvest)
library(stringr)

testscrape <- function(url){
  webpage <- read_html(url)
  
  docname <- webpage %>%
    html_nodes(".left .listing-name") %>%
    html_text()
  
  ph_no <- webpage %>%
    html_nodes(".contact-phone .contact-text") %>%
    html_text()
  
  email <- webpage %>%
    html_nodes(".contact-email") %>%
    html_attr("href") %>%
    as.character() %>%
    str_remove_all(".*:") %>%
    str_remove_all("\\?(.*)") %>%
    str_replace_all("%40","@")
    n <- seq_len(max(length(practice), length(ph_no), length(email)))
    tibble(docname = practice[n], ph_no = ph_no[n], email = email[n])
}
testscrape(url)

# docname ph_no email
#  <lgl>   <lgl> <lgl>
#1   NA      NA    NA