使BeautifulSoup数据成为数据帧Python
我是Python新手。但我正在抓取网络上的数据并获取数据,但在将数据放入数据框时遇到了困难。似乎我只能在数据帧中获得一行数据使BeautifulSoup数据成为数据帧Python,python,pandas,dataframe,beautifulsoup,Python,Pandas,Dataframe,Beautifulsoup,我是Python新手。但我正在抓取网络上的数据并获取数据,但在将数据放入数据框时遇到了困难。似乎我只能在数据帧中获得一行数据 n=range(2009,2021) url2 ='https://www.sports-reference.com/cbb/seasons/' url3 ='-school-stats.html' for n in n: all = url2+str(n)+url3 r = requests.get(all) soup = Beaut
n=range(2009,2021)
url2 ='https://www.sports-reference.com/cbb/seasons/'
url3 ='-school-stats.html'
for n in n:
all = url2+str(n)+url3
r = requests.get(all)
soup = BeautifulSoup(r.text, 'html.parser')
league_table = soup.find('table', class_ = 'per_match_toggle sortable stats_table')
for team in league_table.find_all('tbody'):
rows = team.find_all('tr')
for row in rows:
pl_team = row.find('td', class_ = 'left')
if pl_team == (None):
continue
pl_wins = row.find_all('td', class_ = 'right')[1]
if pl_wins == (None):
continue
pl_loses = row.find_all('td', class_ = 'right')[2]
if pl_wins == (None):
continue
pl_total_points = row.find_all('td', class_ = 'right')[16]
if pl_total_points == (None):
continue
pl_total_points_againest = row.find_all('td', class_ = 'right')[17]
if pl_total_points_againest == (None):
continue
pl_FG_percentage = row.find_all('td', class_ = 'right')[22]
if pl_FG_percentage == (None):
continue
pl_3_percentage = row.find_all('td', class_ = 'right')[25]
if pl_3_percentage == (None):
continue
pl_FT_percentage = row.find_all('td', class_ = 'right')[28]
if pl_FT_percentage == (None):
continue
pl_total_rebounds = row.find_all('td', class_ = 'right')[30]
if pl_total_rebounds == (None):
continue
pl_assist = row.find_all('td', class_ = 'right')[31]
if pl_assist == (None):
continue
pl_steals = row.find_all('td', class_ = 'right')[32]
if pl_steals == (None):
continue
pl_turnovers = row.find_all('td', class_ = 'right')[33]
if pl_total_rebounds == (None):
continue
print(n,
pl_team.text,
pl_wins.text,
pl_loses.text,
pl_total_points.text,
pl_total_points_againest.text,
pl_FG_percentage.text,
pl_3_percentage.text,
pl_FT_percentage.text,
pl_total_rebounds.text,
pl_assist.text,
pl_steals.text,
pl_turnovers.text,
)
data = {'Year': n,
'Team': pl_team.text,
'Wins': pl_wins.text}
输出给出了这个示例(1000多行):
但当我打印“数据”时,我只得到一行,我知道这不是数据帧,但我也无法使DF工作:
{'Year': 2020, 'Team': 'Youngstown State', 'Wins': '18'}
您为Python新手做出了巨大的努力。要知道的一件好事是,
DataFrame
在Python中不是本机数据结构,因此Python不理解什么是DataFrame,没有另一个库就无法创建DataFrame。因此,您需要下载并安装Pandas
库,或许可以从阅读文档开始,以便了解如何从数据初始化数据帧
在代码中,您创建的数据
变量是一个字典。我还要提醒您,当您使用循环时,所有变量都会在循环的每次迭代中重置,因此您需要以某种方式存储信息。为此,我建议使用列表:在循环的每次迭代中,将新数据附加到预先初始化的列表中(在循环之外)
然后,您可以将这些数据列表中的每一个传递给pd.DataFrame()方法来构造数据帧。你会看到我是如何选择这样做的
import pandas as pd
import requests
from bs4 import BeautifulSoup
n=range(2009,2021)
url2 ='https://www.sports-reference.com/cbb/seasons/'
url3 ='-school-stats.html'
## initialize all of the lists but make sure they don't refer to the same empty list
pl_team_list, pl_wins_list, pl_loses_list, pl_total_points_list, pl_total_points_againest_list, \
pl_FG_percentage_list, pl_3_percentage_list, pl_FT_percentage_list, pl_total_rebounds_list, \
pl_assist_list, pl_steals_list, pl_turnovers_list = ([] for i in range(12))
for n in n:
all = url2+str(n)+url3
r = requests.get(all)
soup = BeautifulSoup(r.text, 'html.parser')
league_table = soup.find('table', class_ = 'per_match_toggle sortable stats_table')
for team in league_table.find_all('tbody'):
rows = team.find_all('tr')
for row in rows:
pl_team = row.find('td', class_ = 'left')
if pl_team == (None):
continue
pl_wins = row.find_all('td', class_ = 'right')[1]
if pl_wins == (None):
continue
pl_loses = row.find_all('td', class_ = 'right')[2]
if pl_wins == (None):
continue
pl_total_points = row.find_all('td', class_ = 'right')[16]
if pl_total_points == (None):
continue
pl_total_points_againest = row.find_all('td', class_ = 'right')[17]
if pl_total_points_againest == (None):
continue
pl_FG_percentage = row.find_all('td', class_ = 'right')[22]
if pl_FG_percentage == (None):
continue
pl_3_percentage = row.find_all('td', class_ = 'right')[25]
if pl_3_percentage == (None):
continue
pl_FT_percentage = row.find_all('td', class_ = 'right')[28]
if pl_FT_percentage == (None):
continue
pl_total_rebounds = row.find_all('td', class_ = 'right')[30]
if pl_total_rebounds == (None):
continue
pl_assist = row.find_all('td', class_ = 'right')[31]
if pl_assist == (None):
continue
pl_steals = row.find_all('td', class_ = 'right')[32]
if pl_steals == (None):
continue
pl_turnovers = row.find_all('td', class_ = 'right')[33]
if pl_total_rebounds == (None):
continue
## append the text to each list
pl_team_list.append(pl_team.text)
pl_wins_list.append(pl_wins.text)
pl_loses_list.append(pl_loses.text)
pl_total_points_list.append(pl_total_points.text)
pl_total_points_againest_list.append(pl_total_points_againest.text)
pl_FG_percentage_list.append(pl_FG_percentage.text)
pl_3_percentage_list.append(pl_3_percentage.text)
pl_FT_percentage_list.append(pl_FT_percentage.text)
pl_total_rebounds_list.append(pl_total_rebounds.text)
pl_assist_list.append(pl_assist.text)
pl_steals_list.append(pl_steals.text)
pl_turnovers_list.append(pl_turnovers.text)
# print(n,
# pl_team.text,
# pl_wins.text,
# pl_loses.text,
# pl_total_points.text,
# pl_total_points_againest.text,
# pl_FG_percentage.text,
# pl_3_percentage.text,
# pl_FT_percentage.text,
# pl_total_rebounds.text,
# pl_assist.text,
# pl_steals.text,
# pl_turnovers.text,
# )
# data = {'Year': n,
# 'Team': pl_team.text,
# 'Wins': pl_wins.text}
## initialize your DataFrame by passing a dictionary of your lists of data
df = pd.DataFrame({
'Team': pl_team_list,
'Wins': pl_wins_list,
'Loses': pl_loses_list,
'Total Points': pl_total_points_list,
'Total Points Against': pl_total_points_againest_list,
'FG Percentage': pl_FG_percentage_list,
'3 Percentage': pl_3_percentage_list,
'FT Percentage': pl_FT_percentage_list,
'Total Rebounds': pl_total_rebounds_list,
'Assist': pl_assist_list,
'Steals': pl_steals_list,
'Turnovers': pl_turnovers_list
})
然后,df
如下所示:
您为Python新手付出了巨大的努力。要知道的一件好事是,
DataFrame
在Python中不是本机数据结构,因此Python不理解什么是DataFrame,没有另一个库就无法创建DataFrame。因此,您需要下载并安装Pandas
库,或许可以从阅读文档开始,以便了解如何从数据初始化数据帧
在代码中,您创建的数据
变量是一个字典。我还要提醒您,当您使用循环时,所有变量都会在循环的每次迭代中重置,因此您需要以某种方式存储信息。为此,我建议使用列表:在循环的每次迭代中,将新数据附加到预先初始化的列表中(在循环之外)
然后,您可以将这些数据列表中的每一个传递给pd.DataFrame()方法来构造数据帧。你会看到我是如何选择这样做的
import pandas as pd
import requests
from bs4 import BeautifulSoup
n=range(2009,2021)
url2 ='https://www.sports-reference.com/cbb/seasons/'
url3 ='-school-stats.html'
## initialize all of the lists but make sure they don't refer to the same empty list
pl_team_list, pl_wins_list, pl_loses_list, pl_total_points_list, pl_total_points_againest_list, \
pl_FG_percentage_list, pl_3_percentage_list, pl_FT_percentage_list, pl_total_rebounds_list, \
pl_assist_list, pl_steals_list, pl_turnovers_list = ([] for i in range(12))
for n in n:
all = url2+str(n)+url3
r = requests.get(all)
soup = BeautifulSoup(r.text, 'html.parser')
league_table = soup.find('table', class_ = 'per_match_toggle sortable stats_table')
for team in league_table.find_all('tbody'):
rows = team.find_all('tr')
for row in rows:
pl_team = row.find('td', class_ = 'left')
if pl_team == (None):
continue
pl_wins = row.find_all('td', class_ = 'right')[1]
if pl_wins == (None):
continue
pl_loses = row.find_all('td', class_ = 'right')[2]
if pl_wins == (None):
continue
pl_total_points = row.find_all('td', class_ = 'right')[16]
if pl_total_points == (None):
continue
pl_total_points_againest = row.find_all('td', class_ = 'right')[17]
if pl_total_points_againest == (None):
continue
pl_FG_percentage = row.find_all('td', class_ = 'right')[22]
if pl_FG_percentage == (None):
continue
pl_3_percentage = row.find_all('td', class_ = 'right')[25]
if pl_3_percentage == (None):
continue
pl_FT_percentage = row.find_all('td', class_ = 'right')[28]
if pl_FT_percentage == (None):
continue
pl_total_rebounds = row.find_all('td', class_ = 'right')[30]
if pl_total_rebounds == (None):
continue
pl_assist = row.find_all('td', class_ = 'right')[31]
if pl_assist == (None):
continue
pl_steals = row.find_all('td', class_ = 'right')[32]
if pl_steals == (None):
continue
pl_turnovers = row.find_all('td', class_ = 'right')[33]
if pl_total_rebounds == (None):
continue
## append the text to each list
pl_team_list.append(pl_team.text)
pl_wins_list.append(pl_wins.text)
pl_loses_list.append(pl_loses.text)
pl_total_points_list.append(pl_total_points.text)
pl_total_points_againest_list.append(pl_total_points_againest.text)
pl_FG_percentage_list.append(pl_FG_percentage.text)
pl_3_percentage_list.append(pl_3_percentage.text)
pl_FT_percentage_list.append(pl_FT_percentage.text)
pl_total_rebounds_list.append(pl_total_rebounds.text)
pl_assist_list.append(pl_assist.text)
pl_steals_list.append(pl_steals.text)
pl_turnovers_list.append(pl_turnovers.text)
# print(n,
# pl_team.text,
# pl_wins.text,
# pl_loses.text,
# pl_total_points.text,
# pl_total_points_againest.text,
# pl_FG_percentage.text,
# pl_3_percentage.text,
# pl_FT_percentage.text,
# pl_total_rebounds.text,
# pl_assist.text,
# pl_steals.text,
# pl_turnovers.text,
# )
# data = {'Year': n,
# 'Team': pl_team.text,
# 'Wins': pl_wins.text}
## initialize your DataFrame by passing a dictionary of your lists of data
df = pd.DataFrame({
'Team': pl_team_list,
'Wins': pl_wins_list,
'Loses': pl_loses_list,
'Total Points': pl_total_points_list,
'Total Points Against': pl_total_points_againest_list,
'FG Percentage': pl_FG_percentage_list,
'3 Percentage': pl_3_percentage_list,
'FT Percentage': pl_FT_percentage_list,
'Total Rebounds': pl_total_rebounds_list,
'Assist': pl_assist_list,
'Steals': pl_steals_list,
'Turnovers': pl_turnovers_list
})
然后,df
如下所示: