R使用下拉菜单进行刮削
我正试图从网站上抓取NBA每日ROS预测: 问题是默认选择的玩家数量是200,我想要400或者全部都可以 此代码检索前200个无问题:R使用下拉菜单进行刮削,r,web-scraping,rvest,R,Web Scraping,Rvest,我正试图从网站上抓取NBA每日ROS预测: 问题是默认选择的玩家数量是200,我想要400或者全部都可以 此代码检索前200个无问题: > url <- 'https://hashtagbasketball.com/fantasy-basketball-projections' > > page <- read_html(url) > > projs <- html_table(page)[[3]] %>% ### anything af
> url <- 'https://hashtagbasketball.com/fantasy-basketball-projections'
>
> page <- read_html(url)
>
> projs <- html_table(page)[[3]] %>% ### anything after this just cleans the df
+ rename_all(~gsub('3pm','threes',gsub('\\%','pct',tolower(.)))) %>%
+ mutate_at(vars(matches('pct$')),~stringr::str_sub(.,1,4)) %>%
+ mutate(player = stringr::word(player,1, 2, sep = ' ')) %>%
+ mutate(pos = stringr::word(pos,1,1,sep = ',')) %>%
+ mutate(pos2 = gsub('P','',pos)) %>%
+ drop_na(player) %>%
+ mutate_at(vars(-c(player,matches('pos'),team)),~as.numeric(.)) %>%
+ select(player, matches('pos'),everything(),-`r#`) %>%
+ head(2)
> projs
player pos pos2 team gp mpg fgpct ftpct threes pts treb ast stl blk to total
1 James Harden PG G HOU 64 36.3 0.44 0.86 4.7 34.4 6.6 9.3 1.7 0.8 4.6 17.68
2 Anthony Davis PF F LAL 65 34.8 0.50 0.84 1.3 26.6 9.4 3.2 1.5 2.3 2.5 14.56
这将创建包含所有所需类别的表。但是,当我使用下面的代码时,它不会仅提取gp和mpg的所有统计类别:
> pgsession <- html_session(url)
> pgform <-html_form(pgsession)[[1]]
> filled_form <-set_values(pgform,
+ "ctl00$ContentPlaceHolder1$DDSHOW" = "400")
>
> d <- submit_form(session=pgsession, form=filled_form)
Submitting with '<unnamed>'
>
> y <- d %>%
+ html_nodes("table") %>%
+ .[[3]] %>%
+ html_table(header=TRUE) %>%
+ mutate(PLAYER = stringr::word(PLAYER,1, 2, sep = ' ')) %>%
+ head(2)
> y
R# PLAYER POS TEAM GP MPG TOTAL
1 1 James Harden PG,SG HOU 64 36.3 0.00
2 2 Anthony Davis PF,C LAL 65 34.8 0.00
知道我做错了什么吗?
谢谢问题似乎是在提交表单时没有选中其他变量的复选框。您必须手动设置它们。这将向您展示如何获取ftm和ftpct。我将把其余的留给你:
library(tidyverse)
library(rvest)
url <- 'https://hashtagbasketball.com/fantasy-basketball-projections'
pgsession <- html_session(url)
pgform <-html_form(pgsession)
pgform[[1]][[5]][["ctl00$ContentPlaceHolder1$CBFTM"]]$value <- "checked"
pgform[[1]][[5]][["ctl00$ContentPlaceHolder1$CBFTP"]]$value <- "checked"
filled_form <-set_values(pgform[[1]],"ctl00$ContentPlaceHolder1$DDSHOW" = "400")
d <- submit_form(session=pgsession, form=filled_form)
d %>%
html_nodes("table") %>%
.[[3]] %>%
html_table() %>%
rename_all(~gsub('3pm','threes',gsub('\\%','pct',tolower(.)))) %>%
mutate_at(vars(matches('pct$')),~stringr::str_sub(.,1,4)) %>%
mutate(player = stringr::word(player,1, 2, sep = ' ')) %>%
mutate(pos = stringr::word(pos,1,1,sep = ',')) %>%
mutate(pos2 = gsub('P','',pos)) %>%
drop_na(player) %>%
mutate_at(vars(-c(player,matches('pos'),team)),~as.numeric(.)) %>%
select(player, matches('pos'),everything(),-`r#`) %>%
head(2)
# player pos pos2 team gp mpg ftm ftpct total
#1 James Harden PG G HOU 64 36.3 10.4 0.86 10.95
#2 Devin Booker SG SG PHX 70 35.6 6.7 0.91 7.99
如果您不知道,您可以通过右键单击并选择“在Chrome中检查”来获取复选框名称:
问题似乎在于,提交表单时未选中其他变量的复选框。您必须手动设置它们。这将向您展示如何获取ftm和ftpct。我将把其余的留给你:
library(tidyverse)
library(rvest)
url <- 'https://hashtagbasketball.com/fantasy-basketball-projections'
pgsession <- html_session(url)
pgform <-html_form(pgsession)
pgform[[1]][[5]][["ctl00$ContentPlaceHolder1$CBFTM"]]$value <- "checked"
pgform[[1]][[5]][["ctl00$ContentPlaceHolder1$CBFTP"]]$value <- "checked"
filled_form <-set_values(pgform[[1]],"ctl00$ContentPlaceHolder1$DDSHOW" = "400")
d <- submit_form(session=pgsession, form=filled_form)
d %>%
html_nodes("table") %>%
.[[3]] %>%
html_table() %>%
rename_all(~gsub('3pm','threes',gsub('\\%','pct',tolower(.)))) %>%
mutate_at(vars(matches('pct$')),~stringr::str_sub(.,1,4)) %>%
mutate(player = stringr::word(player,1, 2, sep = ' ')) %>%
mutate(pos = stringr::word(pos,1,1,sep = ',')) %>%
mutate(pos2 = gsub('P','',pos)) %>%
drop_na(player) %>%
mutate_at(vars(-c(player,matches('pos'),team)),~as.numeric(.)) %>%
select(player, matches('pos'),everything(),-`r#`) %>%
head(2)
# player pos pos2 team gp mpg ftm ftpct total
#1 James Harden PG G HOU 64 36.3 10.4 0.86 10.95
#2 Devin Booker SG SG PHX 70 35.6 6.7 0.91 7.99
如果您不知道,您可以通过右键单击并选择“在Chrome中检查”来获取复选框名称:
请记住,Lappy将返回其结果,而不是在全局环境中修改副本。使用for循环,您可能会更幸运。set_值也可能有效,但我发现一些GitHub问题表明它可能不适用于复选框。请记住,Lappy会返回结果,而不是在全局环境中修改副本。使用for循环,您可能会更幸运。set_值也可以工作,但我发现一些GitHub问题表明它可能不适用于复选框。