Html 使用R跨多个链接进行Web抓取_Html_R_Web Scraping_Rvest

Html 使用R跨多个链接进行Web抓取

html r web-scraping

Html 使用R跨多个链接进行Web抓取,html,r,web-scraping,rvest,Html,R,Web Scraping,Rvest,我正试图为几个网站的一些新闻发布创建一个整洁的数据框架。大多数网站的结构都是这样的，即有一个标题主页和一个简短的导语，然后是一个到主要文章的链接。我想从主页上删掉所有的主要文章。这是我的方法。任何帮助都将不胜感激 library(tidyverse) library(rvest) library(xml2) url_1 <- read_html("http://lifepointhealth.net/news") ## seems to grab the list

我正试图为几个网站的一些新闻发布创建一个整洁的数据框架。大多数网站的结构都是这样的，即有一个标题主页和一个简短的导语，然后是一个到主要文章的链接。我想从主页上删掉所有的主要文章。这是我的方法。任何帮助都将不胜感激

library(tidyverse)
library(rvest)
library(xml2)

url_1 <- read_html("http://lifepointhealth.net/news")


## seems to grab the lists
url_1 %>% 
  html_nodes("li") %>% 
  html_text() %>% 
  str_squish() %>% 
  str_trim() %>% 
  enframe()

# A tibble: 80 x 2
    name value                      
   <int> <chr>                      
 1     1 Who We Are Our Company Mis…
 2     2 Our Company Mission, Visio…
 3     3 Mission, Vision, Values an…
 4     4 Giving Quality a Voice     
 5     5 How We Operate             
 6     6 Leadership                 
 7     7 Awards                     
 8     8 20th Anniversary           
 9     9 Our Communities Explore Ou…
10    10 Explore Our Communities    
# … with 70 more rows


# this grabs the titles but there should be many more
url_1 %>% 
  html_nodes("li .title") %>% 
  html_text() %>% 
  str_squish() %>% 
  str_trim() %>% 
  enframe() 

# A tibble: 20 x 2
    name value                      
   <int> <chr>                      
 1     1 LifePoint Health Names Elm…
 2     2 David Steitz Named Chief E…
 3     3 LifePoint Health Receives …
 4     4 Thousands of Top U.S. Hosp…
 5     5 Conemaugh Nason Medical Ce…
 6     6 Vicki Parks Named CEO of W…
 7     7 LifePoint Health Honors Ka…
 8     8 Ennis Regional Medical Cen…
 9     9 LifePoint Health Business …
10    10 LifePoint Health and R1 RC…

库（tidyverse）
图书馆（rvest）
库（xml2）
url_1%
html_节点（“li”）%>%
html_text（）%>%
str_squish（）%>%
str_trim（）%%>%
enframe（）
#一个tibble:80x2
名称值
我们是谁我们公司的管理信息系统…
我们公司的使命，愿景…
使命、愿景、价值观和…
4.提高声音质量
5.我们如何运作
6领导能力
7个奖项
8 20周年纪念
我们的社区探索你…
10探索我们的社区
#…还有70行
#这一点很有吸引力，但应该还有更多
url_1%>%
html_节点（“li.title”）%>%
html_text（）%>%
str_squish（）%>%
str_trim（）%%>%
enframe（）
#一个tibble:20x2
名称值
1 1 LifePoint健康名称Elm…
2大卫·施泰茨任命E…
3生命点健康接收…
美国4千家顶级医院…
5科内马·纳森医学中心…
6 Vicki Parks被任命为W…
7生命点健康荣誉卡…
8埃尼斯区域医疗中心…
9生命点健康业务…
10生命点健康和R1 RC…

按照开发工具的网络选项卡，您将看到页面将请求发送到

http://lifepointhealth.net/api/posts

每次单击“加载更多”。模仿下面的请求，您将能够获得所有332篇文章的详细信息：

items <- httr::POST(
  "http://lifepointhealth.net/api/posts",
  config = httr::add_headers(`Content-Type` = "application/x-www-form-urlencoded"),
  body = "skip=0&take=332&Type=News&tagFilter=",
  encode = "multipart"
) %>% 
  httr::content() %>%
  .$Items

items <- dplyr::bind_rows(lapply(items, function(f) {
  as.data.frame(Filter(Negate(is.null), f))
}))

项目%
httr:：content（）%%>%
.$项目
这些东西很好用，谢谢。