Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/78.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R Web抓取网站的多个级别_R_Web Scraping_Rvest - Fatal编程技术网

R Web抓取网站的多个级别

R Web抓取网站的多个级别,r,web-scraping,rvest,R,Web Scraping,Rvest,我正在寻找刮一个网站。然后,对于每一个抓取的项目,我想在子网页上抓取更多的信息。作为一个例子,我将使用IMDB网站。我使用的是rvest软件包和谷歌浏览器中的 从IMDB站点,我可以获得如下信息: library('rvest') # url to be scrapped url <- 'http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2' #Reading the HTML code from the website webpage

我正在寻找刮一个网站。然后,对于每一个抓取的项目,我想在子网页上抓取更多的信息。作为一个例子,我将使用IMDB网站。我使用的是
rvest
软件包和谷歌浏览器中的

从IMDB站点,我可以获得如下信息:

library('rvest')

# url to be scrapped
url <- 'http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2'

#Reading the HTML code from the website
webpage <- read_html(url)

#Using CSS selectors to scrap
movies_html <- html_nodes(webpage,'.titleColumn a')

#Converting the TV show data to text
movies <- html_text(movies_html)

head(movies)
[1] "Planet Earth II"  "Band of Brothers" "Planet Earth"     "Game of Thrones"  "Breaking Bad"     "The Wire"

我相信这很简单,但对我来说是新的,我不知道如何寻找合适的术语来找到解决方案

如果我没弄错的话,你是在寻找一种

  • 识别前250个电影页面的url(
    main\u url
  • 获取前250个节目的标题(
    m_titles

  • 访问这些URL(
    m_URL

  • 提取那些电视节目的演员阵容(
    m_cast
  • 对吗

    首先,我们将定义一个从电视节目页面提取演员阵容的函数:

    getcast <- function(url){
      page <- read_html(url)
      nodes <- html_nodes(page, '#titleCast .itemprop')
      cast <- html_text(nodes)
    
      inds <- seq(from=2, to=length(cast), by=2)
      cast <- cast[inds]
      return(cast)
    }
    

    getcast有一个小错误,
    read\u html(url)
    应该是
    read\u html(main\u url)
    ,所以我不允许编辑少于6个字符的文章。但这正是我想要的,谢谢!
    getcast <- function(url){
      page <- read_html(url)
      nodes <- html_nodes(page, '#titleCast .itemprop')
      cast <- html_text(nodes)
    
      inds <- seq(from=2, to=length(cast), by=2)
      cast <- cast[inds]
      return(cast)
    }
    
    # Open main_url and navigate to interesting part of the page:
    main_url <- "http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2"
    
    main_page <- read_html(url)
    movies_html <- html_nodes(main_page, '.titleColumn a')
    
    # From the interesting part, get the titles and URLs:
    m_titles <- html_text(movies_html)
    
    sub_urls <- html_attr(movies_html, 'href')
    m_urls <- paste0('http://www.imdb.com', sub_urls)
    
    # Use `getcast()` to extract movie cast from every URL in `m_urls`
    m_cast <- lapply(m_urls, getcast)