使用rvest访问html表
所以我想搜集一些NBA的数据。以下是目前为止我所拥有的,它功能完善:使用rvest访问html表,html,css,r,rvest,Html,Css,R,Rvest,所以我想搜集一些NBA的数据。以下是目前为止我所拥有的,它功能完善: install.packages('rvest') library(rvest) url = "https://www.basketball-reference.com/boxscores/201710180BOS.html" webpage = read_html(url) table = html_nodes(webpage, 'table') data = html_table(table) away = data[
install.packages('rvest')
library(rvest)
url = "https://www.basketball-reference.com/boxscores/201710180BOS.html"
webpage = read_html(url)
table = html_nodes(webpage, 'table')
data = html_table(table)
away = data[[1]]
home = data[[3]]
colnames(away) = away[1,] #set appropriate column names
colnames(home) = home[1,]
away = away[away$MP != "MP",] #remove rows that are just column names
home = home[home$MP != "MP",]
问题是这些表不包含团队名称,这一点很重要。为了得到这些信息,我想我会在网页上刮掉四因素表,然而,rvest似乎没有意识到这是一个表。包含四个因素表的div是:
<div class="overthrow table_container" id="div_four_factors">
<table class="suppress_all sortable stats_table now_sortable" id="four_factors" data-cols-to-freeze="1"><thead><tr class="over_header thead">
但这似乎不起作用,因为我得到的只是一个空列表。如何访问四因素表?我绝不是HTML专家,但您感兴趣的表似乎在源代码中被注释掉了,然后注释在呈现之前的某个点被覆盖 如果我们假设主队总是排在第二位,我们可以使用位置参数并在页面上刮取另一个表:
table = html_nodes(webpage,'#bottom_nav_container')
teams <- html_text(table[1]) %>%
stringr::str_split("Schedule\n")
away$team <- trimws(teams[[1]][1])
home$team <- trimws(teams[[1]][2])
显然,这不是最干净的解决方案,但这就是刮网世界的生活
table = html_nodes(webpage,'#bottom_nav_container')
teams <- html_text(table[1]) %>%
stringr::str_split("Schedule\n")
away$team <- trimws(teams[[1]][1])
home$team <- trimws(teams[[1]][2])