使用rvest访问html表_Html_Css_R_Rvest

使用rvest访问html表

html css r

使用rvest访问html表,html,css,r,rvest,Html,Css,R,Rvest,所以我想搜集一些NBA的数据。以下是目前为止我所拥有的，它功能完善： install.packages('rvest') library(rvest) url = "https://www.basketball-reference.com/boxscores/201710180BOS.html" webpage = read_html(url) table = html_nodes(webpage, 'table') data = html_table(table) away = data[

所以我想搜集一些NBA的数据。以下是目前为止我所拥有的，它功能完善：

install.packages('rvest')
library(rvest)

url = "https://www.basketball-reference.com/boxscores/201710180BOS.html"
webpage = read_html(url)
table = html_nodes(webpage, 'table')
data = html_table(table)

away = data[[1]]
home = data[[3]]

colnames(away) = away[1,] #set appropriate column names
colnames(home) = home[1,]

away = away[away$MP != "MP",] #remove rows that are just column names
home = home[home$MP != "MP",]

问题是这些表不包含团队名称，这一点很重要。为了得到这些信息，我想我会在网页上刮掉四因素表，然而，rvest似乎没有意识到这是一个表。包含四个因素表的div是：

<div class="overthrow table_container" id="div_four_factors">

<table class="suppress_all sortable stats_table now_sortable" id="four_factors" data-cols-to-freeze="1"><thead><tr class="over_header thead">

但这似乎不起作用，因为我得到的只是一个空列表。如何访问四因素表？

我绝不是HTML专家，但您感兴趣的表似乎在源代码中被注释掉了，然后注释在呈现之前的某个点被覆盖

如果我们假设主队总是排在第二位，我们可以使用位置参数并在页面上刮取另一个表：

table = html_nodes(webpage,'#bottom_nav_container')
teams <- html_text(table[1]) %>%
  stringr::str_split("Schedule\n")

away$team <- trimws(teams[[1]][1])
home$team <- trimws(teams[[1]][2])

显然，这不是最干净的解决方案，但这就是刮网世界的生活

table = html_nodes(webpage,'#bottom_nav_container')
teams <- html_text(table[1]) %>%
  stringr::str_split("Schedule\n")

away$team <- trimws(teams[[1]][1])
home$team <- trimws(teams[[1]][2])