R 拉网电影录像机
我正试图从moviemeter中获取电影名称、评级和年份,以便将其与imdb进行比较。我设法将imdb前250部电影放入一个包含标题、评级、排名和年份的数据框中。但是我好像没能让电影放映器运转 这是我的代码:R 拉网电影录像机,r,web-scraping,R,Web Scraping,我正试图从moviemeter中获取电影名称、评级和年份,以便将其与imdb进行比较。我设法将imdb前250部电影放入一个包含标题、评级、排名和年份的数据框中。但是我好像没能让电影放映器运转 这是我的代码: url <- rvest::html("https://www.moviemeter.nl/list/") scrapemoviemeter <- rvest::html_nodes(x = url, css = ".film_row") head(scrapemoviem
url <- rvest::html("https://www.moviemeter.nl/list/")
scrapemoviemeter <- rvest::html_nodes(x = url, css = ".film_row")
head(scrapemoviemeter)
moviemeter <- rvest::html_text(scrapemoviemeter, trim = TRUE)
如何将数据放入一个与评级、标题和年份分开的数据框中?我认为使用XPath更容易。试试这个
library(rvest)
library(stringi)
url <- rvest::html("https://www.moviemeter.nl/list/")
scores <- rvest::html_nodes(x = url, xpath = "/html/body/div[1]/div[4]/div/div[3]/*//span[@class='score']")
scores <- rvest::html_text(scores, trim = TRUE)
names <- rvest::html_nodes(x = url, xpath = "/html/body/div[1]/div[4]/div/div[3]/*//a[@class='tooltip']")
names <- rvest::html_text(names, trim = TRUE)
years <- rvest::html_nodes(x = url, xpath = "/html/body/div[1]/div[4]/div/div[3]//div[@class='film_row']/text() ")
years <- rvest::html_text(years, trim = TRUE)
years <- stri_extract(years, regex = "\\b\\d{4}\\b")
years <- years[!is.na(years)]
names <- unlist(names)
scores <- unlist(scores)
years <- unlist(years)
df <- cbind(names, scores, years)
df <- as.data.frame(df)
库(rvest)
图书馆(stringi)
url如果您有IMDB id,请使用MovieMeter API vs scraping:
library(moviemeter) # devtools::install_github("hrbrmstr/moviemeter")
library(purrr)
imdb_ids <- c("tt1107846", "tt0282552", "tt0048199")
map_df(imdb_ids, function(x) {
mm <- mm_get_movie_info(x)
mm <- map(mm, ~. %||% NA) # the javascript has nulls, so get rid of them
mm[c(1:11)] # remove posters, countries, genres, actors and directors
}) -> df
dplyr::glimpse(df)
## Observations: 3
## Variables: 11
## $ id <int> 57161, 6465, 33351
## $ url <chr> "https://www.moviemeter.nl/film/57161", "https://www.moviemeter.nl/film/6465", "https://www.moviemeter.nl/film/33351"
## $ year <int> 2007, 2002, 1955
## $ imdb <chr> "tt1107846", "tt0282552", "tt0048199"
## $ title <chr> "Theft", "Riders", "Illegal"
## $ display_title <chr> "Theft", "Riders", "Illegal"
## $ alternative_title <chr> NA, "Steal", NA
## $ plot <chr> "Een naïeve dorpsjongen wordt verliefd op een crimineel. Guy was altijd een nette beschaafde jongen, wie had er ooi...
## $ duration <int> 90, 83, 88
## $ votes_count <int> 1, 293, 20
## $ average <dbl> 2.00, 2.55, 3.42
library(moviemeter)#devtools::install_github(“hrbrmstr/moviemeter”)
图书馆(purrr)
imdb_ids刮取imdb违反了其服务条款。所以,如果你从抓取IMDB中获得IMDB ID,你就违反了他们的服务条款。MovieMeter有一个API。有一个R软件包可以与API一起使用,他们在发布他们数据中的任何衍生作品时也会请求引用/归属。@indian friends-与其建议对下面的答案进行编辑以删除所有文本,不如删除问题该问题发生了什么事?它从IMDB+MovieMeter&R发展到能量饮料和Python。这个问题应该删除。OP知道编辑历史是完全可用的,对吗?
library(moviemeter) # devtools::install_github("hrbrmstr/moviemeter")
library(purrr)
imdb_ids <- c("tt1107846", "tt0282552", "tt0048199")
map_df(imdb_ids, function(x) {
mm <- mm_get_movie_info(x)
mm <- map(mm, ~. %||% NA) # the javascript has nulls, so get rid of them
mm[c(1:11)] # remove posters, countries, genres, actors and directors
}) -> df
dplyr::glimpse(df)
## Observations: 3
## Variables: 11
## $ id <int> 57161, 6465, 33351
## $ url <chr> "https://www.moviemeter.nl/film/57161", "https://www.moviemeter.nl/film/6465", "https://www.moviemeter.nl/film/33351"
## $ year <int> 2007, 2002, 1955
## $ imdb <chr> "tt1107846", "tt0282552", "tt0048199"
## $ title <chr> "Theft", "Riders", "Illegal"
## $ display_title <chr> "Theft", "Riders", "Illegal"
## $ alternative_title <chr> NA, "Steal", NA
## $ plot <chr> "Een naïeve dorpsjongen wordt verliefd op een crimineel. Guy was altijd een nette beschaafde jongen, wie had er ooi...
## $ duration <int> 90, 83, 88
## $ votes_count <int> 1, 293, 20
## $ average <dbl> 2.00, 2.55, 3.42