使用rvest进行web抓取：通过分页进行过滤_R_Web Scraping

使用rvest进行web抓取：通过分页进行过滤

r web-scraping

使用rvest进行web抓取：通过分页进行过滤,r,web-scraping,R,Web Scraping,我对R相当陌生，一直在尝试使用“Rvest”软件包我目前正试图从网站上搜集数据我的目标是能够找到房子的价格，卧室的数量，浴室的数量，以及房子的面积通过谷歌搜索，我在GitHub上找到了哈德利的一些代码。这是非常有帮助的，但是在运行之后，我注意到它一次只能返回一页我希望能够看到一个房子的总清单，上面的属性与每个房子相关我知道，当我尝试通过网站“分页”进行过滤时，它一次只能让我查看一个页面。此特定Web搜索共有20页我可以看到，在URL中，唯一改变的是最后的内容：原始URL（第1页）=

我对R相当陌生，一直在尝试使用“Rvest”软件包

我目前正试图从网站上搜集数据

我的目标是能够找到房子的价格，卧室的数量，浴室的数量，以及房子的面积

通过谷歌搜索，我在GitHub上找到了哈德利的一些代码。这是非常有帮助的，但是在运行之后，我注意到它一次只能返回一页

我希望能够看到一个房子的总清单，上面的属性与每个房子相关

我知道，当我尝试通过网站“分页”进行过滤时，它一次只能让我查看一个页面。此特定Web搜索共有20页

我可以看到，在URL中，唯一改变的是最后的内容：

原始URL（第1页）=

https://www.zillow.com/homes/for_sale/Charlotte-NC/24043_rid/globalrelevanceex_sort/35.479124，-80.39177,34.929289，-81.270676_rect/9_zm/

URL（第2页）=

https://www.zillow.com/homes/for_sale/Charlotte-NC/24043_rid/globalrelevanceex_sort/35.479124，-80.39177,34.929289，-81.270676_-rect/9_-zm/2_-p/

只有

/2\u p/

在改变

如果你转到第3页，它会说

/3\u p/

等等

是否有办法循环浏览所有页面，并将属性保存到数据帧中，然后访问该数据帧

以下是我正在使用的代码：

# Inspired by https://github.com/notesofdabbler
library(rvest)
library(tidyr)

page <- read_html("https://www.zillow.com/homes/for_sale/Charlotte-NC/24043_rid/globalrelevanceex_sort/35.304479,-80.247574,35.104743,-81.414871_rect/9_zm/")

houses <- page %>%
  html_nodes(".photo-cards li article")

z_id <- houses %>% html_attr("id")

address <- houses %>%
  html_node(".zsg-photo-card-address") %>%
  html_text()

price <- houses %>%
  html_node(".zsg-photo-card-price") %>%
  html_text() %>%
  readr::parse_number()

params <- houses %>%
  html_node(".zsg-photo-card-info") %>%
  html_text() %>%
  strsplit("&middot;")

beds = params %>% purrr::map_chr(1) %>% readr::parse_number()
baths <- params %>% purrr::map_chr(1) %>% readr::parse_number()
house_area <- params %>% purrr::map_chr(1) %>% readr::parse_number()


df_price = data.frame(price)

df_beds = data.frame(beds)

df_baths = data.frame(baths)

df_house_area = data.frame(house_area)

#灵感来自https://github.com/notesofdabbler
图书馆（rvest）
图书馆（tidyr）
页数%
html_text（）
价格%
html_节点（“.zsg照片卡价格”）%>%
html_text（）%>%
readr:：parse_number（）
参数%
html_节点（“.zsg照片卡信息”）%>%
html_text（）%>%
strsplit（“·；”）
beds=params%%>%purrr:：map_chr（1）%%>%readr:：parse_number（）
baths%purrr:：map_chr（1）%%>%readr:：parse_number（）
房屋面积%purrr:：地图面积（1）%%>%readr:：解析编号（）
df_价格=数据帧（价格）
df_床=数据帧（床）
df_baths=数据帧（baths）
df_house_area=数据帧（house_area）

谢谢大家!

我们可以使用

sprintf

library(tidyverse)
links <- sprintf("https://www.zillow.com/homes/for_sale/Charlotte-NC/24043_rid/globalrelevanceex_sort/35.479124,-80.39177,34.929289,-81.270676_rect/9_zm/%d_p", 1:20)

-输出

res
# A tibble: 500 x 5
#   page_no   price  beds baths house_area
#   <chr>     <dbl> <dbl> <dbl>      <dbl>
# 1 1       1995000  5.00  7.00       8110
# 2 1        325000  3.00  2.00       1897
# 3 1       1099000  5.00  4.00       3532
# 4 1        550990  4.00  4.00       2953
# 5 1        323000  5.00  3.00       3100
# 6 1        315000  3.00  3.00       1723
# 7 1       2600000  5.00  7.00       7124
# 8 1       1300000  5.00  5.00       4737
# 9 1        549900  2.00  2.00       1788
#10 1        538000  5.00  4.00       3595
# ... with 490 more rows

res
#一个tibble:500 x 5
#页码\u无价格床浴室\u区域
#                
# 1 1       1995000  5.00  7.00       8110
# 2 1        325000  3.00  2.00       1897
# 3 1       1099000  5.00  4.00       3532
# 4 1        550990  4.00  4.00       2953
# 5 1        323000  5.00  3.00       3100
# 6 1        315000  3.00  3.00       1723
# 7 1       2600000  5.00  7.00       7124
# 8 1       1300000  5.00  5.00       4737
# 9 1        549900  2.00  2.00       1788
#10 1        538000  5.00  4.00       3595
# ... 还有490行

-将提取的信息与第1页上的前几条帖子进行核对

仅供参考-我知道有一个@JasonAizkalns，但我想通过使用此软件包来增加我的最新知识。在这之后，我想刮掉其他形式的网页，我不确定Rzillow是否允许我这么做

res
# A tibble: 500 x 5
#   page_no   price  beds baths house_area
#   <chr>     <dbl> <dbl> <dbl>      <dbl>
# 1 1       1995000  5.00  7.00       8110
# 2 1        325000  3.00  2.00       1897
# 3 1       1099000  5.00  4.00       3532
# 4 1        550990  4.00  4.00       2953
# 5 1        323000  5.00  3.00       3100
# 6 1        315000  3.00  3.00       1723
# 7 1       2600000  5.00  7.00       7124
# 8 1       1300000  5.00  5.00       4737
# 9 1        549900  2.00  2.00       1788
#10 1        538000  5.00  4.00       3595
# ... with 490 more rows