R 胡说八道,不知道如何继续
作为一个辅助项目,我试图收集与梦幻足球相关的NFL球员的统计数据。我找到了一个包含所需数据的URL: 我正试图把它刮到R里,但运气不好。我尝试过很多东西,最接近的是:R 胡说八道,不知道如何继续,r,web-scraping,rvest,R,Web Scraping,Rvest,作为一个辅助项目,我试图收集与梦幻足球相关的NFL球员的统计数据。我找到了一个包含所需数据的URL: 我正试图把它刮到R里,但运气不好。我尝试过很多东西,最接近的是: Test1 <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/") %>% html_nodes('.TableBase-bodyTr') 这只是一个纯粹的混
Test1 <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/") %>% html_nodes('.TableBase-bodyTr')
这只是一个纯粹的混沌,里面嵌入了相关的信息。我还尝试在它上面使用html_table(),但只是得到了一个错误
现在,如果我在“Test1”上使用View函数,我可以钻取许多层的数据并找到我要查找的内容,但我试图弄清楚的是如何直接获取这些数据
我真的不知道接下来该怎么办。如果有人能给我一些建议,我会非常感激。我对HTML的熟悉程度非常低,我正试图阅读更多关于它的内容并理解它,但从我通过查看页面所收集到的信息来看,数据存储在类“TableBase bodyTr”中,这就是我将节点指向该类的原因。表格格式有点怪异,导致了一个错误
HTML\u table()
。我不知道该怎么纠正
这里有一种替代方法,可以刮取行的内容,然后创建数据帧
library(rvest)
page <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/")
#find the rows of the table
rows<-page%>% html_nodes('tr')
#the first 2 rows are the header information skipping those
#get the playname (both short and long verision)
playername <- rows[-c(1, 2)] %>% html_nodes('td span span a') %>% html_text() %>% trimws()
playername <- matrix(playername, ncol=2, byrow=TRUE)
#get the team and position
position <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-position') %>% html_text() %>% trimws()
team <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-team') %>% html_text() %>% trimws()
#get the stats from the table
cols <- rows[-c(1, 2)] %>% html_nodes('td') %>% html_text() %>% trimws()
stats <-matrix(cols, ncol=16, byrow=TRUE)
#make the final answer
answer <- data.frame(playername, position, team, stats[, -1])
#still need to rename the columns
statnames<-c("Name_s", "Name_l", "position", "team", 'GP', 'ATT', 'CMP', 'YDS', 'YDS/G', "TD", 'INT', 'RATE', 'ATT', 'YDS', 'AVG', 'TD', 'FL', 'FPTS', "FPPG")
names(answer) <- statnames
库(rvest)
页面%html\u text()%%>%trimws()
播放名称%html_text()%>%trimws()
团队%html\u节点('span.CellPlayerName团队')%%>%html\u文本()%%>%trimws()
#从表中获取统计信息
cols%html\u节点('td')%%>%html\u文本()%%>%trimws()
谢谢你!我正在检查代码,试图理解所有内容。还有一些令人毛骨悚然的数据,需要进一步研究。谢谢你的努力,我真的很感激。HTML真的让我很烦,不确定要查看哪个节点。@MSCRN,是的,这个页面不容易,希望上面的评论能提供足够的指导来提供最终结论。
[65] "\n \n \n \n \n \n J. Eason\n \n \n \n QB\n \n \n \n IND\n \n \n \n \n \n \n \n
library(rvest)
page <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/")
#find the rows of the table
rows<-page%>% html_nodes('tr')
#the first 2 rows are the header information skipping those
#get the playname (both short and long verision)
playername <- rows[-c(1, 2)] %>% html_nodes('td span span a') %>% html_text() %>% trimws()
playername <- matrix(playername, ncol=2, byrow=TRUE)
#get the team and position
position <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-position') %>% html_text() %>% trimws()
team <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-team') %>% html_text() %>% trimws()
#get the stats from the table
cols <- rows[-c(1, 2)] %>% html_nodes('td') %>% html_text() %>% trimws()
stats <-matrix(cols, ncol=16, byrow=TRUE)
#make the final answer
answer <- data.frame(playername, position, team, stats[, -1])
#still need to rename the columns
statnames<-c("Name_s", "Name_l", "position", "team", 'GP', 'ATT', 'CMP', 'YDS', 'YDS/G', "TD", 'INT', 'RATE', 'ATT', 'YDS', 'AVG', 'TD', 'FL', 'FPTS', "FPPG")
names(answer) <- statnames