R Web抓取网站的多个级别_R_Web Scraping_Rvest

R Web抓取网站的多个级别

r web-scraping

R Web抓取网站的多个级别,r,web-scraping,rvest,R,Web Scraping,Rvest,我正在寻找刮一个网站。然后，对于每一个抓取的项目，我想在子网页上抓取更多的信息。作为一个例子，我将使用IMDB网站。我使用的是rvest软件包和谷歌浏览器中的从IMDB站点，我可以获得如下信息： library('rvest') # url to be scrapped url <- 'http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2' #Reading the HTML code from the website webpage

我正在寻找刮一个网站。然后，对于每一个抓取的项目，我想在子网页上抓取更多的信息。作为一个例子，我将使用IMDB网站。我使用的是

rvest

软件包和谷歌浏览器中的

从IMDB站点，我可以获得如下信息：

library('rvest')

# url to be scrapped
url <- 'http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2'

#Reading the HTML code from the website
webpage <- read_html(url)

#Using CSS selectors to scrap
movies_html <- html_nodes(webpage,'.titleColumn a')

#Converting the TV show data to text
movies <- html_text(movies_html)

head(movies)
[1] "Planet Earth II"  "Band of Brothers" "Planet Earth"     "Game of Thrones"  "Breaking Bad"     "The Wire"

我相信这很简单，但对我来说是新的，我不知道如何寻找合适的术语来找到解决方案

如果我没弄错的话，你是在寻找一种

识别前250个电影页面的url（

main\u url

）

获取前250个节目的标题（

m_titles

）

访问这些URL（

m_URL

）

提取那些电视节目的演员阵容（

m_cast

）

对吗

首先，我们将定义一个从电视节目页面提取演员阵容的函数：

getcast <- function(url){
  page <- read_html(url)
  nodes <- html_nodes(page, '#titleCast .itemprop')
  cast <- html_text(nodes)

  inds <- seq(from=2, to=length(cast), by=2)
  cast <- cast[inds]
  return(cast)
}

getcast有一个小错误，read\u html（url）
应该是read\u html（main\u url），所以我不允许编辑少于6个字符的文章。但这正是我想要的，谢谢！
getcast <- function(url){
  page <- read_html(url)
  nodes <- html_nodes(page, '#titleCast .itemprop')
  cast <- html_text(nodes)

  inds <- seq(from=2, to=length(cast), by=2)
  cast <- cast[inds]
  return(cast)
}

# Open main_url and navigate to interesting part of the page:
main_url <- "http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2"

main_page <- read_html(url)
movies_html <- html_nodes(main_page, '.titleColumn a')

# From the interesting part, get the titles and URLs:
m_titles <- html_text(movies_html)

sub_urls <- html_attr(movies_html, 'href')
m_urls <- paste0('http://www.imdb.com', sub_urls)

# Use `getcast()` to extract movie cast from every URL in `m_urls`
m_cast <- lapply(m_urls, getcast)