R 清理从Web上刮取的数据

R 清理从Web上刮取的数据,r,web-scraping,rvest,R,Web Scraping,Rvest,对r有点陌生,我一直在做一个项目(只是为了好玩)来帮助我学习,我遇到了一些我似乎无法在网上找到答案的事情。我正试图自学如何从网站中获取数据,我从下面的代码开始,该代码从247项运动中检索一些数据 library(rvest) library(stringr) link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank" link.scrap <- read_h

对r有点陌生,我一直在做一个项目(只是为了好玩)来帮助我学习,我遇到了一些我似乎无法在网上找到答案的事情。我正试图自学如何从网站中获取数据,我从下面的代码开始,该代码从247项运动中检索一些数据

library(rvest)
library(stringr)

link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"

link.scrap <- read_html(link)
data <- 
  html_nodes(x   = link.scrap, 
             css = '#page-content > div.main-div.clearfix > section.list-page > section > section > ul.content-list.ri-list > li:nth-child(3)') %>%
  html_text(trim = TRUE) %>% 
  trimws()
库(rvest)
图书馆(stringr)
链接%
html_文本(trim=TRUE)%>%
trimws()
当我查看数据时,它似乎是一个长度为1的向量,多个列表项存储为一个值。我遇到的问题是,试图将这些内容分为各自的列。例如,当我运行下面的代码时,我认为应该在“)”处分割数据,然后从两个结果值中删除空格,我得到一个奇怪的结果

f<-strsplit(data,")")
str_trim(f)
[1] "c(\"Ray Lima  El Camino College (Torrance, CA\", \"         DT 6-3 310    0.8681      39 4 9       Enrolled   1/9/2017\")"

f我已经修改了您代码中的一些内容

  • 采用通用方法引用css,因此能够提取整行内容

  • 收集单个列作为向量,然后构建数据帧

请查收

library(rvest)
library(stringr)
library(tidyr)

link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"

link.scrap <- read_html(link)

names <- link.scrap %>% html_nodes('div.name') %>% html_text()

pos <- link.scrap %>% html_nodes('ul.metrics-list') %>% html_text() 

status <- link.scrap %>% html_nodes('div.right-content.right') %>% html_text() 

data <- data.frame(names,pos,status, stringsAsFactors = F)

data <- data[-1,]

head(data)


> head(data)
                                                      names          pos                     status
2        Kamilo Tongamoa  Merced College (Merced, CA)        DT 6-5 320     Enrolled   8/24/2017   
3        Ray Lima  El Camino College (Torrance, CA)          DT 6-3 310      Enrolled   1/9/2017   
4  O'Rien Vance  George Washington (Cedar Rapids, IA)       OLB 6-3 235     Enrolled   6/12/2017   
5          Matt Leo  Arizona Western College (Yuma, AZ)     WDE 6-7 265     Enrolled   2/22/2017   
6            Keontae Jones  Colerain (Cincinnati, OH)         S 6-1 175     Enrolled   6/12/2017   
7      Cordarrius Bailey  Clarksdale (Clarksdale, MS)       WDE 6-4 210     Enrolled   6/12/2017   
> 
库(rvest)
图书馆(stringr)
图书馆(tidyr)
链接%html\u text()
状态%html\u节点('div.right-content.right')%>%html\u文本()

data我修改了您代码中的一些内容

  • 采用通用方法引用css,因此能够提取整行内容

  • 收集单个列作为向量,然后构建数据帧

请查收

library(rvest)
library(stringr)
library(tidyr)

link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"

link.scrap <- read_html(link)

names <- link.scrap %>% html_nodes('div.name') %>% html_text()

pos <- link.scrap %>% html_nodes('ul.metrics-list') %>% html_text() 

status <- link.scrap %>% html_nodes('div.right-content.right') %>% html_text() 

data <- data.frame(names,pos,status, stringsAsFactors = F)

data <- data[-1,]

head(data)


> head(data)
                                                      names          pos                     status
2        Kamilo Tongamoa  Merced College (Merced, CA)        DT 6-5 320     Enrolled   8/24/2017   
3        Ray Lima  El Camino College (Torrance, CA)          DT 6-3 310      Enrolled   1/9/2017   
4  O'Rien Vance  George Washington (Cedar Rapids, IA)       OLB 6-3 235     Enrolled   6/12/2017   
5          Matt Leo  Arizona Western College (Yuma, AZ)     WDE 6-7 265     Enrolled   2/22/2017   
6            Keontae Jones  Colerain (Cincinnati, OH)         S 6-1 175     Enrolled   6/12/2017   
7      Cordarrius Bailey  Clarksdale (Clarksdale, MS)       WDE 6-4 210     Enrolled   6/12/2017   
> 
库(rvest)
图书馆(stringr)
图书馆(tidyr)
链接%html\u text()
状态%html\u节点('div.right-content.right')%>%html\u文本()

数据基本问题是网页包含看起来像表的内容,但实际上它是一个具有大量样式的列表。这意味着您需要处理每个元素,拉出相关节点,并根据需要进一步处理节点内容

首先,抓住整个列表:

library(dplyr)
library(rvest)

iowa_state <- read_html("https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank") %>%
  html_nodes('ul.content-list.ri-list')
库(dplyr)
图书馆(rvest)
爱荷华州%
html_节点('ul.content list.ri list')
提取指标(位置、身高、体重)。这将创建一个向量,其中前3个元素是标题(Pos、Ht、Wt),然后每个玩家的度量一次填充其他3个元素

metrics <- iowa_state %>% 
  html_nodes("ul.metrics-list li") %>% 
  html_text() %>% 
  trimws()
metrics%
html_节点(“ul.metrics-list li”)%>%
html_text()%>%
trimws()
提取状态(“已登记”和日期)。这将创建一个向量,其中“已注册”填充元素1、3、5。。。日期填充元素2,4,6

status <- iowa_state %>% 
  html_nodes("p.commit-date") %>% 
  html_text() %>% 
  trimws()
状态%
html_节点(“p.commit-date”)%>%
html_text()%>%
trimws()
现在,我们可以逐列构建数据帧(或TIBLE):

iowa_state_df <- tibble(name     = iowa_state %>% html_nodes("a.name") %>% html_text(),
                        college  = iowa_state %>% html_nodes("span.meta") %>% html_text() %>% trimws(),
                        pos      = metrics[seq(4, length(metrics)-2, 3)],
                        ht       = metrics[seq(5, length(metrics)-1, 3)],
                        wt       = metrics[seq(6, length(metrics), 3)],
                        score    = iowa_state %>% html_nodes("span.score") %>% html_text(),
                        natrank  = iowa_state %>% html_nodes("div.rank a.natrank") %>% html_text(),
                        posrank  = iowa_state %>% html_nodes("div.rank a.posrank") %>% html_text(),
                        sttrank  = iowa_state %>% html_nodes("div.rank a.sttrank") %>% html_text(),
                        enrolled = status[seq(1, length(status)-1, 2)],
                        date     = status[seq(2, length(status), 2)]
)

glimpse(iowa_state_df)

Observations: 26
Variables: 11
$ name     <chr> "Kamilo Tongamoa", "Ray Lima", "O'Rien Vance", "Matt Leo", "Keontae Jones", "Cordarriu...
$ college  <chr> "Merced College (Merced, CA)", "El Camino College (Torrance, CA)", "George Washington ...
$ pos      <chr> "DT", "DT", "OLB", "WDE", "S", "WDE", "WR", "CB", "CB", "DUAL", "SDE", "OT", "OT", "WR...
$ ht       <chr> "6-5", "6-3", "6-3", "6-7", "6-1", "6-4", "5-11", "6-1", "6-0.5", "6-4", "6-3", "6-5",...
$ wt       <chr> "320", "310", "235", "265", "175", "210", "170", "190", "170", "221", "250", "260", "3...
$ score    <chr> "0.8742", "0.8681", "0.8681", "0.8656", "0.8624", "0.8546", "0.8515", "0.8482", "0.847...
$ natrank  <chr> "28", "39", "508", "48", "587", "724", "806", "885", "924", "928", "929", "NA", "NA", ...
$ posrank  <chr> "3", "4", "29", "5", "42", "42", "117", "91", "100", "19", "42", "88", "90", "12", "57...
$ sttrank  <chr> "5", "9", "4", "7", "25", "13", "9", "124", "20", "8", "6", "10", "24", "37", "20", "1...
$ enrolled <chr> "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "E...
$ date     <chr> "8/24/2017", "1/9/2017", "6/12/2017", "2/22/2017", "6/12/2017", "6/12/2017", "6/12/201...
iowa\u state\u df%html\u节点(“a.name”)%%>%html\u text(),
college=iowa_state%%>%html_节点(“span.meta”)%%>%html_文本()%%>%trimws(),
pos=指标[序号(4,长度(指标)-2,3)],
ht=指标[序号(5,长度(指标)-1,3)],
wt=指标[序号(6,长度(指标),3)],
分数=爱荷华州%>%html\u节点(“span.score”)%%>%html\u文本(),
natrank=iowa\u state%%>%html\u节点(“div.rank a.natrank”)%%>%html\u text(),
posrank=iowa\u state%%>%html\u节点(“div.rank a.posrank”)%%>%html\u text(),
sttrank=iowa\u state%%>%html\u节点(“div.rank a.sttrank”)%%>%html\u text(),
登记=状态[序号(1,长度(状态)-1,2)],
日期=状态[序号(2,长度(状态),2)]
)
一瞥(爱荷华州)
意见:26
变量:11
$name“Kamilo Tongamoa”,“Ray Lima”,“O'Rien Vance”,“Matt Leo”,“Keontae Jones”,“Cordarriu…”。。。
$college“Merced学院(加利福尼亚州Merced)”,“El Camino学院(加利福尼亚州托伦斯)”,“乔治·华盛顿。。。
$pos“DT”,“DT”,“OLB”,“WDE”,“S”,“WDE”,“WR”,“CB”,“CB”,“DUAL”,“SDE”,“OT”,“OT”,“WR…”。。。
$ht“6-5”、“6-3”、“6-3”、“6-7”、“6-1”、“6-4”、“5-11”、“6-1”、“6-0.5”、“6-4”、“6-3”、“6-5”,。。。
$wt“320”,“310”,“235”,“265”,“175”,“210”,“170”,“190”,“170”,“221”,“250”,“260”,“3…”。。。
$score“0.8742”、“0.8681”、“0.8681”、“0.8656”、“0.8624”、“0.8546”、“0.8515”、“0.8482”、“0.847…”。。。
$natrank“28”、“39”、“508”、“48”、“587”、“724”、“806”、“885”、“924”、“928”、“929”、“NA”、“NA”。。。
$posrank“3”,“4”,“29”,“5”,“42”,“42”,“117”,“91”,“100”,“19”,“42”,“88”,“90”,“12”,“57…”。。。
$sttrank“5”、“9”、“4”、“7”、“25”、“13”、“9”、“124”、“20”、“8”、“6”、“10”、“24”、“37”、“20”、“1…”。。。
$已注册“已注册”,“已注册”,“已注册”,“已注册”,“已注册”,“已注册”,“已注册”,“已注册”,“E。。。
$date“8/24/2017”、“1/9/2017”、“6/12/2017”、“2/22/2017”、“6/12/2017”、“6/12/2017”、“6/12/201…”。。。

然后,您可以根据需要设置列类型(日期、数字等)的格式。

基本问题是网页包含看起来像表格的内容,但实际上它是一个具有大量样式的列表。这意味着您需要处理每个元素,拉出相关节点,并根据需要进一步处理节点内容

首先,抓住整个列表:

library(dplyr)
library(rvest)

iowa_state <- read_html("https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank") %>%
  html_nodes('ul.content-list.ri-list')
库(dplyr)
图书馆(rvest)
爱荷华州%
html_节点('ul.content list.ri list')
提取指标(位置、高度、重量)。这将创建一个向量,其中前3个元素是标题(位置、高度、重量),然后每个玩家的指标一次填充其他3个元素

metrics <- iowa_state %>% 
  html_nodes("ul.metrics-list li") %>% 
  html_text() %>% 
  trimws()
metrics%
html_节点(“ul.metrics-list li”)%>%
html_text()%>%
trimws()
提取状态(“已登记”和日期)。这将创建一个向量,其中“已登记”填充元素1、3、5…和日期填充元素2、4、6

status <- iowa_state %>% 
  html_nodes("p.commit-date") %>% 
  html_text() %>% 
  trimws()
状态%
html_节点(“p.commit-date”)%>%
html_text()%>%
trimws()
现在,我们可以逐列构建数据帧(或TIBLE):

iowa_state_df <- tibble(name     = iowa_state %>% html_nodes("a.name") %>% html_text(),
                        college  = iowa_state %>% html_nodes("span.meta") %>% html_text() %>% trimws(),
                        pos      = metrics[seq(4, length(metrics)-2, 3)],
                        ht       = metrics[seq(5, length(metrics)-1, 3)],
                        wt       = metrics[seq(6, length(metrics), 3)],
                        score    = iowa_state %>% html_nodes("span.score") %>% html_text(),
                        natrank  = iowa_state %>% html_nodes("div.rank a.natrank") %>% html_text(),
                        posrank  = iowa_state %>% html_nodes("div.rank a.posrank") %>% html_text(),
                        sttrank  = iowa_state %>% html_nodes("div.rank a.sttrank") %>% html_text(),
                        enrolled = status[seq(1, length(status)-1, 2)],
                        date     = status[seq(2, length(status), 2)]
)

glimpse(iowa_state_df)

Observations: 26
Variables: 11
$ name     <chr> "Kamilo Tongamoa", "Ray Lima", "O'Rien Vance", "Matt Leo", "Keontae Jones", "Cordarriu...
$ college  <chr> "Merced College (Merced, CA)", "El Camino College (Torrance, CA)", "George Washington ...
$ pos      <chr> "DT", "DT", "OLB", "WDE", "S", "WDE", "WR", "CB", "CB", "DUAL", "SDE", "OT", "OT", "WR...
$ ht       <chr> "6-5", "6-3", "6-3", "6-7", "6-1", "6-4", "5-11", "6-1", "6-0.5", "6-4", "6-3", "6-5",...
$ wt       <chr> "320", "310", "235", "265", "175", "210", "170", "190", "170", "221", "250", "260", "3...
$ score    <chr> "0.8742", "0.8681", "0.8681", "0.8656", "0.8624", "0.8546", "0.8515", "0.8482", "0.847...
$ natrank  <chr> "28", "39", "508", "48", "587", "724", "806", "885", "924", "928", "929", "NA", "NA", ...
$ posrank  <chr> "3", "4", "29", "5", "42", "42", "117", "91", "100", "19", "42", "88", "90", "12", "57...
$ sttrank  <chr> "5", "9", "4", "7", "25", "13", "9", "124", "20", "8", "6", "10", "24", "37", "20", "1...
$ enrolled <chr> "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "E...
$ date     <chr> "8/24/2017", "1/9/2017", "6/12/2017", "2/22/2017", "6/12/2017", "6/12/2017", "6/12/201...
iowa\u state\u df%html\u节点(“a.name”)%%>%html\u text(),
college=iowa_state%%>%html_节点(“span.meta”)%%>%html_文本()%%>%trimws(),