R 胡说八道，不知道如何继续_R_Web Scraping_Rvest

R 胡说八道，不知道如何继续

r web-scraping

R 胡说八道，不知道如何继续,r,web-scraping,rvest,R,Web Scraping,Rvest,作为一个辅助项目，我试图收集与梦幻足球相关的NFL球员的统计数据。我找到了一个包含所需数据的URL：我正试图把它刮到R里，但运气不好。我尝试过很多东西，最接近的是： Test1 <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/") %>% html_nodes('.TableBase-bodyTr') 这只是一个纯粹的混

作为一个辅助项目，我试图收集与梦幻足球相关的NFL球员的统计数据。我找到了一个包含所需数据的URL：

我正试图把它刮到R里，但运气不好。我尝试过很多东西，最接近的是：

Test1 <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/") %>% html_nodes('.TableBase-bodyTr')

这只是一个纯粹的混沌，里面嵌入了相关的信息。我还尝试在它上面使用html_table（），但只是得到了一个错误

现在，如果我在“Test1”上使用View函数，我可以钻取许多层的数据并找到我要查找的内容，但我试图弄清楚的是如何直接获取这些数据

我真的不知道接下来该怎么办。如果有人能给我一些建议，我会非常感激。我对HTML的熟悉程度非常低，我正试图阅读更多关于它的内容并理解它，但从我通过查看页面所收集到的信息来看，数据存储在类“TableBase bodyTr”中，这就是我将节点指向该类的原因。

表格格式有点怪异，导致了一个错误

HTML\u table（）

。我不知道该怎么纠正

这里有一种替代方法，可以刮取行的内容，然后创建数据帧

library(rvest)
page <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/") 

#find the rows of the table
rows<-page%>% html_nodes('tr')

#the first 2 rows are the header information skipping those
#get the playname (both short and long verision)
playername <- rows[-c(1, 2)] %>% html_nodes('td span span a') %>% html_text() %>% trimws() 
playername <- matrix(playername, ncol=2, byrow=TRUE)

#get the team and position
position <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-position') %>% html_text() %>% trimws() 
team <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-team') %>% html_text() %>% trimws() 

#get the stats from the table
cols <- rows[-c(1, 2)] %>% html_nodes('td') %>% html_text() %>% trimws() 
stats <-matrix(cols, ncol=16, byrow=TRUE)

#make the final answer
answer <- data.frame(playername, position, team, stats[, -1])
#still need to rename the columns
statnames<-c("Name_s", "Name_l", "position", "team",  'GP', 'ATT', 'CMP', 'YDS', 'YDS/G', "TD", 'INT', 'RATE', 'ATT', 'YDS', 'AVG', 'TD', 'FL', 'FPTS', "FPPG")
names(answer) <- statnames

库（rvest）
页面%html\u text（）%%>%trimws（）
播放名称%html_text（）%>%trimws（）
团队%html\u节点（'span.CellPlayerName团队'）%%>%html\u文本（）%%>%trimws（）
#从表中获取统计信息
cols%html\u节点（'td'）%%>%html\u文本（）%%>%trimws（）
谢谢你！我正在检查代码，试图理解所有内容。还有一些令人毛骨悚然的数据，需要进一步研究。谢谢你的努力，我真的很感激。HTML真的让我很烦，不确定要查看哪个节点。@MSCRN，是的，这个页面不容易，希望上面的评论能提供足够的指导来提供最终结论。
[65] "\n                    \n                        \n                        \n            \n                                                                                                    \n            J. Eason\n    \n                                        \n                                    \n                        QB\n                    \n                    \n                                    \n                        IND\n                    \n                                \n                \n                \n                            \n        \n        \n            

library(rvest)
page <- read_html("https://www.cbssports.com/fantasy/football/stats/QB/2020/season/projections/ppr/") 

#find the rows of the table
rows<-page%>% html_nodes('tr')

#the first 2 rows are the header information skipping those
#get the playname (both short and long verision)
playername <- rows[-c(1, 2)] %>% html_nodes('td span span a') %>% html_text() %>% trimws() 
playername <- matrix(playername, ncol=2, byrow=TRUE)

#get the team and position
position <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-position') %>% html_text() %>% trimws() 
team <- rows[-c(1, 2)] %>% html_nodes('span.CellPlayerName-team') %>% html_text() %>% trimws() 

#get the stats from the table
cols <- rows[-c(1, 2)] %>% html_nodes('td') %>% html_text() %>% trimws() 
stats <-matrix(cols, ncol=16, byrow=TRUE)

#make the final answer
answer <- data.frame(playername, position, team, stats[, -1])
#still need to rename the columns
statnames<-c("Name_s", "Name_l", "position", "team",  'GP', 'ATT', 'CMP', 'YDS', 'YDS/G', "TD", 'INT', 'RATE', 'ATT', 'YDS', 'AVG', 'TD', 'FL', 'FPTS', "FPPG")
names(answer) <- statnames