Python 淘汰主队
我正在做一个项目,我想搜集2019/20赛季从10月到8月的NBA比赛统计数据 我只关注主队和客场球队的比赛结果,而不是球员/球队的具体统计数据,因此我需要使用“基本方框分数统计”表获得每场比赛的方框分数数据 问题:在抓取禁区得分时,我只收集客队的数据,因为这是禁区得分链接中的第一个表,我只需使用索引[0]指定该表(它是静态的)。对于主队来说,表索引似乎会根据是否有随时间变化(OT)而变化,有时还会由于其他未指定的变化(这有点动态) 问题:如何最好地使用循环来收集客场和主队每个月的方块分数?或者,我如何为主队收集每个框得分的数据 一段时间内不带的比赛的方框得分页面示例: 随着时间的推移,与进行比赛的框得分页面示例: 在后一个示例中,主队的表索引根据前面的表数(包含数据的表,如随时间变化等)而变化。通常是第八张没有加班的桌子,而有加班的桌子则不同 我成功(且一致)获取客场球队数据的代码如下:Python 淘汰主队,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我正在做一个项目,我想搜集2019/20赛季从10月到8月的NBA比赛统计数据 我只关注主队和客场球队的比赛结果,而不是球员/球队的具体统计数据,因此我需要使用“基本方框分数统计”表获得每场比赛的方框分数数据 问题:在抓取禁区得分时,我只收集客队的数据,因为这是禁区得分链接中的第一个表,我只需使用索引[0]指定该表(它是静态的)。对于主队来说,表索引似乎会根据是否有随时间变化(OT)而变化,有时还会由于其他未指定的变化(这有点动态) 问题:如何最好地使用循环来收集客场和主队每个月的方块分数?或者
box_score_example_url='http://www.basketball-reference.com//boxscores/201910230POR.html'
dfbox[]
for eachBox in box_score_example_url:
dfz = pd.read_html(eachBox)
dfbox.append(dfz[0])
boxbox_awayteam = pd.concat(dfbox)
boxbox_awayteam
我没有这个想法,因为在HTML代码中似乎没有任何表具有特定的id或类。这是我的第一个网页抓取项目,也是我在Stackoverflow上提出的第一个问题,让我一目了然。你可以使用BeautifulSoup和CSS选择器
[id$=“-game basic”]table
只选择两个基本表,然后用pd.read\u html()加载这些表:
印刷品:
Starters MP ... PTS +/-
0 Jrue Holiday 41:05 ... 13 -14
1 Brandon Ingram 35:06 ... 22 -19
2 J.J. Redick 27:03 ... 16 -14
3 Lonzo Ball 24:50 ... 8 -7
4 Derrick Favors 20:46 ... 6 -12
5 Reserves MP ... PTS +/-
6 Josh Hart 28:10 ... 15 -1
7 Nicolò Melli 19:37 ... 14 +11
8 Kenrich Williams 18:02 ... 3 +11
9 Frank Jackson 13:51 ... 9 +7
10 Jahlil Okafor 12:29 ... 8 -7
11 E'Twaun Moore 12:06 ... 5 -1
12 Nickeil Alexander-Walker 11:55 ... 3 +6
13 Jaxson Hayes Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 122 NaN
[15 rows x 21 columns]
Starters MP ... PTS +/-
0 Kyle Lowry 44:59 ... 22 -1
1 Fred VanVleet 44:21 ... 34 +18
2 Pascal Siakam 38:09 ... 34 +5
3 OG Anunoby 35:48 ... 11 +12
4 Marc Gasol 31:55 ... 6 -2
5 Reserves MP ... PTS +/-
6 Norman Powell 28:38 ... 5 +2
7 Serge Ibaka 26:00 ... 13 +6
8 Terence Davis 15:10 ... 5 0
9 Matt Thomas Did Not Play ... Did Not Play Did Not Play
10 Chris Boucher Did Not Play ... Did Not Play Did Not Play
11 Stanley Johnson Did Not Play ... Did Not Play Did Not Play
12 Malcolm Miller Did Not Play ... Did Not Play Did Not Play
13 Dewan Hernandez Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 130 NaN
[15 rows x 21 columns]
https://www.basketball-reference.com/leagues/NBA_2020_games-october.html
https://www.basketball-reference.com/boxscores/201910220TOR.html
Starters MP ... PTS +/-
0 Jrue Holiday 41:05 ... 13 -14
1 Brandon Ingram 35:06 ... 22 -19
2 J.J. Redick 27:03 ... 16 -14
3 Lonzo Ball 24:50 ... 8 -7
4 Derrick Favors 20:46 ... 6 -12
5 Reserves MP ... PTS +/-
6 Josh Hart 28:10 ... 15 -1
7 Nicolò Melli 19:37 ... 14 +11
8 Kenrich Williams 18:02 ... 3 +11
9 Frank Jackson 13:51 ... 9 +7
10 Jahlil Okafor 12:29 ... 8 -7
11 E'Twaun Moore 12:06 ... 5 -1
12 Nickeil Alexander-Walker 11:55 ... 3 +6
13 Jaxson Hayes Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 122 NaN
[15 rows x 21 columns]
Starters MP ... PTS +/-
0 Kyle Lowry 44:59 ... 22 -1
1 Fred VanVleet 44:21 ... 34 +18
2 Pascal Siakam 38:09 ... 34 +5
3 OG Anunoby 35:48 ... 11 +12
4 Marc Gasol 31:55 ... 6 -2
5 Reserves MP ... PTS +/-
6 Norman Powell 28:38 ... 5 +2
7 Serge Ibaka 26:00 ... 13 +6
8 Terence Davis 15:10 ... 5 0
9 Matt Thomas Did Not Play ... Did Not Play Did Not Play
10 Chris Boucher Did Not Play ... Did Not Play Did Not Play
11 Stanley Johnson Did Not Play ... Did Not Play Did Not Play
12 Malcolm Miller Did Not Play ... Did Not Play Did Not Play
13 Dewan Hernandez Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 130 NaN
[15 rows x 21 columns]
--------------------------------------------------------------------------------
https://www.basketball-reference.com/boxscores/201910220LAC.html
Starters MP ... PTS +/-
0 Anthony Davis 37:22 ... 25 +3
1 LeBron James 36:00 ... 18 -8
2 Danny Green 32:20 ... 28 +7
...and so on.
编辑:要将此函数放入循环中,可以使用以下示例:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/leagues/NBA_2020_games.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
def get_tables(url):
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
my_tables = soup.select('[id$="-game-basic"] table')
df_1 = pd.read_html(str(my_tables[0]))[0].droplevel(0, axis=1)
df_2 = pd.read_html(str(my_tables[1]))[0].droplevel(0, axis=1)
return df_1, df_2
for a in soup.select('.filter a'):
u = 'https://www.basketball-reference.com' + a['href']
print(u)
soup2 = BeautifulSoup(requests.get(u).content, 'html.parser')
for a2 in soup2.select('td a[href^="/boxscores/"]'):
u2 = 'https://www.basketball-reference.com' + a2['href']
t1, t2 = get_tables(u2)
print(u2)
print(t1)
print(t2)
print('-' * 80)
印刷品:
Starters MP ... PTS +/-
0 Jrue Holiday 41:05 ... 13 -14
1 Brandon Ingram 35:06 ... 22 -19
2 J.J. Redick 27:03 ... 16 -14
3 Lonzo Ball 24:50 ... 8 -7
4 Derrick Favors 20:46 ... 6 -12
5 Reserves MP ... PTS +/-
6 Josh Hart 28:10 ... 15 -1
7 Nicolò Melli 19:37 ... 14 +11
8 Kenrich Williams 18:02 ... 3 +11
9 Frank Jackson 13:51 ... 9 +7
10 Jahlil Okafor 12:29 ... 8 -7
11 E'Twaun Moore 12:06 ... 5 -1
12 Nickeil Alexander-Walker 11:55 ... 3 +6
13 Jaxson Hayes Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 122 NaN
[15 rows x 21 columns]
Starters MP ... PTS +/-
0 Kyle Lowry 44:59 ... 22 -1
1 Fred VanVleet 44:21 ... 34 +18
2 Pascal Siakam 38:09 ... 34 +5
3 OG Anunoby 35:48 ... 11 +12
4 Marc Gasol 31:55 ... 6 -2
5 Reserves MP ... PTS +/-
6 Norman Powell 28:38 ... 5 +2
7 Serge Ibaka 26:00 ... 13 +6
8 Terence Davis 15:10 ... 5 0
9 Matt Thomas Did Not Play ... Did Not Play Did Not Play
10 Chris Boucher Did Not Play ... Did Not Play Did Not Play
11 Stanley Johnson Did Not Play ... Did Not Play Did Not Play
12 Malcolm Miller Did Not Play ... Did Not Play Did Not Play
13 Dewan Hernandez Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 130 NaN
[15 rows x 21 columns]
https://www.basketball-reference.com/leagues/NBA_2020_games-october.html
https://www.basketball-reference.com/boxscores/201910220TOR.html
Starters MP ... PTS +/-
0 Jrue Holiday 41:05 ... 13 -14
1 Brandon Ingram 35:06 ... 22 -19
2 J.J. Redick 27:03 ... 16 -14
3 Lonzo Ball 24:50 ... 8 -7
4 Derrick Favors 20:46 ... 6 -12
5 Reserves MP ... PTS +/-
6 Josh Hart 28:10 ... 15 -1
7 Nicolò Melli 19:37 ... 14 +11
8 Kenrich Williams 18:02 ... 3 +11
9 Frank Jackson 13:51 ... 9 +7
10 Jahlil Okafor 12:29 ... 8 -7
11 E'Twaun Moore 12:06 ... 5 -1
12 Nickeil Alexander-Walker 11:55 ... 3 +6
13 Jaxson Hayes Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 122 NaN
[15 rows x 21 columns]
Starters MP ... PTS +/-
0 Kyle Lowry 44:59 ... 22 -1
1 Fred VanVleet 44:21 ... 34 +18
2 Pascal Siakam 38:09 ... 34 +5
3 OG Anunoby 35:48 ... 11 +12
4 Marc Gasol 31:55 ... 6 -2
5 Reserves MP ... PTS +/-
6 Norman Powell 28:38 ... 5 +2
7 Serge Ibaka 26:00 ... 13 +6
8 Terence Davis 15:10 ... 5 0
9 Matt Thomas Did Not Play ... Did Not Play Did Not Play
10 Chris Boucher Did Not Play ... Did Not Play Did Not Play
11 Stanley Johnson Did Not Play ... Did Not Play Did Not Play
12 Malcolm Miller Did Not Play ... Did Not Play Did Not Play
13 Dewan Hernandez Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 130 NaN
[15 rows x 21 columns]
--------------------------------------------------------------------------------
https://www.basketball-reference.com/boxscores/201910220LAC.html
Starters MP ... PTS +/-
0 Anthony Davis 37:22 ... 25 +3
1 LeBron James 36:00 ... 18 -8
2 Danny Green 32:20 ... 28 +7
...and so on.
您可以使用BeautifulSoup和CSS选择器[id$=“-game basic”]表
仅选择两个基本表,然后使用pd.read_html()加载这些表
:
印刷品:
Starters MP ... PTS +/-
0 Jrue Holiday 41:05 ... 13 -14
1 Brandon Ingram 35:06 ... 22 -19
2 J.J. Redick 27:03 ... 16 -14
3 Lonzo Ball 24:50 ... 8 -7
4 Derrick Favors 20:46 ... 6 -12
5 Reserves MP ... PTS +/-
6 Josh Hart 28:10 ... 15 -1
7 Nicolò Melli 19:37 ... 14 +11
8 Kenrich Williams 18:02 ... 3 +11
9 Frank Jackson 13:51 ... 9 +7
10 Jahlil Okafor 12:29 ... 8 -7
11 E'Twaun Moore 12:06 ... 5 -1
12 Nickeil Alexander-Walker 11:55 ... 3 +6
13 Jaxson Hayes Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 122 NaN
[15 rows x 21 columns]
Starters MP ... PTS +/-
0 Kyle Lowry 44:59 ... 22 -1
1 Fred VanVleet 44:21 ... 34 +18
2 Pascal Siakam 38:09 ... 34 +5
3 OG Anunoby 35:48 ... 11 +12
4 Marc Gasol 31:55 ... 6 -2
5 Reserves MP ... PTS +/-
6 Norman Powell 28:38 ... 5 +2
7 Serge Ibaka 26:00 ... 13 +6
8 Terence Davis 15:10 ... 5 0
9 Matt Thomas Did Not Play ... Did Not Play Did Not Play
10 Chris Boucher Did Not Play ... Did Not Play Did Not Play
11 Stanley Johnson Did Not Play ... Did Not Play Did Not Play
12 Malcolm Miller Did Not Play ... Did Not Play Did Not Play
13 Dewan Hernandez Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 130 NaN
[15 rows x 21 columns]
https://www.basketball-reference.com/leagues/NBA_2020_games-october.html
https://www.basketball-reference.com/boxscores/201910220TOR.html
Starters MP ... PTS +/-
0 Jrue Holiday 41:05 ... 13 -14
1 Brandon Ingram 35:06 ... 22 -19
2 J.J. Redick 27:03 ... 16 -14
3 Lonzo Ball 24:50 ... 8 -7
4 Derrick Favors 20:46 ... 6 -12
5 Reserves MP ... PTS +/-
6 Josh Hart 28:10 ... 15 -1
7 Nicolò Melli 19:37 ... 14 +11
8 Kenrich Williams 18:02 ... 3 +11
9 Frank Jackson 13:51 ... 9 +7
10 Jahlil Okafor 12:29 ... 8 -7
11 E'Twaun Moore 12:06 ... 5 -1
12 Nickeil Alexander-Walker 11:55 ... 3 +6
13 Jaxson Hayes Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 122 NaN
[15 rows x 21 columns]
Starters MP ... PTS +/-
0 Kyle Lowry 44:59 ... 22 -1
1 Fred VanVleet 44:21 ... 34 +18
2 Pascal Siakam 38:09 ... 34 +5
3 OG Anunoby 35:48 ... 11 +12
4 Marc Gasol 31:55 ... 6 -2
5 Reserves MP ... PTS +/-
6 Norman Powell 28:38 ... 5 +2
7 Serge Ibaka 26:00 ... 13 +6
8 Terence Davis 15:10 ... 5 0
9 Matt Thomas Did Not Play ... Did Not Play Did Not Play
10 Chris Boucher Did Not Play ... Did Not Play Did Not Play
11 Stanley Johnson Did Not Play ... Did Not Play Did Not Play
12 Malcolm Miller Did Not Play ... Did Not Play Did Not Play
13 Dewan Hernandez Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 130 NaN
[15 rows x 21 columns]
--------------------------------------------------------------------------------
https://www.basketball-reference.com/boxscores/201910220LAC.html
Starters MP ... PTS +/-
0 Anthony Davis 37:22 ... 25 +3
1 LeBron James 36:00 ... 18 -8
2 Danny Green 32:20 ... 28 +7
...and so on.
编辑:要将此函数放入循环中,可以使用以下示例:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/leagues/NBA_2020_games.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
def get_tables(url):
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
my_tables = soup.select('[id$="-game-basic"] table')
df_1 = pd.read_html(str(my_tables[0]))[0].droplevel(0, axis=1)
df_2 = pd.read_html(str(my_tables[1]))[0].droplevel(0, axis=1)
return df_1, df_2
for a in soup.select('.filter a'):
u = 'https://www.basketball-reference.com' + a['href']
print(u)
soup2 = BeautifulSoup(requests.get(u).content, 'html.parser')
for a2 in soup2.select('td a[href^="/boxscores/"]'):
u2 = 'https://www.basketball-reference.com' + a2['href']
t1, t2 = get_tables(u2)
print(u2)
print(t1)
print(t2)
print('-' * 80)
印刷品:
Starters MP ... PTS +/-
0 Jrue Holiday 41:05 ... 13 -14
1 Brandon Ingram 35:06 ... 22 -19
2 J.J. Redick 27:03 ... 16 -14
3 Lonzo Ball 24:50 ... 8 -7
4 Derrick Favors 20:46 ... 6 -12
5 Reserves MP ... PTS +/-
6 Josh Hart 28:10 ... 15 -1
7 Nicolò Melli 19:37 ... 14 +11
8 Kenrich Williams 18:02 ... 3 +11
9 Frank Jackson 13:51 ... 9 +7
10 Jahlil Okafor 12:29 ... 8 -7
11 E'Twaun Moore 12:06 ... 5 -1
12 Nickeil Alexander-Walker 11:55 ... 3 +6
13 Jaxson Hayes Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 122 NaN
[15 rows x 21 columns]
Starters MP ... PTS +/-
0 Kyle Lowry 44:59 ... 22 -1
1 Fred VanVleet 44:21 ... 34 +18
2 Pascal Siakam 38:09 ... 34 +5
3 OG Anunoby 35:48 ... 11 +12
4 Marc Gasol 31:55 ... 6 -2
5 Reserves MP ... PTS +/-
6 Norman Powell 28:38 ... 5 +2
7 Serge Ibaka 26:00 ... 13 +6
8 Terence Davis 15:10 ... 5 0
9 Matt Thomas Did Not Play ... Did Not Play Did Not Play
10 Chris Boucher Did Not Play ... Did Not Play Did Not Play
11 Stanley Johnson Did Not Play ... Did Not Play Did Not Play
12 Malcolm Miller Did Not Play ... Did Not Play Did Not Play
13 Dewan Hernandez Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 130 NaN
[15 rows x 21 columns]
https://www.basketball-reference.com/leagues/NBA_2020_games-october.html
https://www.basketball-reference.com/boxscores/201910220TOR.html
Starters MP ... PTS +/-
0 Jrue Holiday 41:05 ... 13 -14
1 Brandon Ingram 35:06 ... 22 -19
2 J.J. Redick 27:03 ... 16 -14
3 Lonzo Ball 24:50 ... 8 -7
4 Derrick Favors 20:46 ... 6 -12
5 Reserves MP ... PTS +/-
6 Josh Hart 28:10 ... 15 -1
7 Nicolò Melli 19:37 ... 14 +11
8 Kenrich Williams 18:02 ... 3 +11
9 Frank Jackson 13:51 ... 9 +7
10 Jahlil Okafor 12:29 ... 8 -7
11 E'Twaun Moore 12:06 ... 5 -1
12 Nickeil Alexander-Walker 11:55 ... 3 +6
13 Jaxson Hayes Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 122 NaN
[15 rows x 21 columns]
Starters MP ... PTS +/-
0 Kyle Lowry 44:59 ... 22 -1
1 Fred VanVleet 44:21 ... 34 +18
2 Pascal Siakam 38:09 ... 34 +5
3 OG Anunoby 35:48 ... 11 +12
4 Marc Gasol 31:55 ... 6 -2
5 Reserves MP ... PTS +/-
6 Norman Powell 28:38 ... 5 +2
7 Serge Ibaka 26:00 ... 13 +6
8 Terence Davis 15:10 ... 5 0
9 Matt Thomas Did Not Play ... Did Not Play Did Not Play
10 Chris Boucher Did Not Play ... Did Not Play Did Not Play
11 Stanley Johnson Did Not Play ... Did Not Play Did Not Play
12 Malcolm Miller Did Not Play ... Did Not Play Did Not Play
13 Dewan Hernandez Did Not Play ... Did Not Play Did Not Play
14 Team Totals 265 ... 130 NaN
[15 rows x 21 columns]
--------------------------------------------------------------------------------
https://www.basketball-reference.com/boxscores/201910220LAC.html
Starters MP ... PTS +/-
0 Anthony Davis 37:22 ... 25 +3
1 LeBron James 36:00 ... 18 -8
2 Danny Green 32:20 ... 28 +7
...and so on.
非常感谢!这是一个巨大的步骤,从我在那里,但你能告诉我如何使循环,将迭代通过所有的链接在一个给定的月份?我有一个列表,其中包含8月份所有框分数的链接,如何将您建议的代码应用到循环中?非常感谢!这是一个巨大的步骤,从我在那里,但你能告诉我如何使循环,将迭代通过所有的链接在一个给定的月份?我有一个列表,其中包含8月份所有框分数的链接,如何将您建议的代码应用于循环?