使BeautifulSoup数据成为数据帧Python

使BeautifulSoup数据成为数据帧Python,python,pandas,dataframe,beautifulsoup,Python,Pandas,Dataframe,Beautifulsoup,我是Python新手。但我正在抓取网络上的数据并获取数据,但在将数据放入数据框时遇到了困难。似乎我只能在数据帧中获得一行数据 n=range(2009,2021) url2 ='https://www.sports-reference.com/cbb/seasons/' url3 ='-school-stats.html' for n in n: all = url2+str(n)+url3 r = requests.get(all) soup = Beaut

我是Python新手。但我正在抓取网络上的数据并获取数据,但在将数据放入数据框时遇到了困难。似乎我只能在数据帧中获得一行数据

    n=range(2009,2021)

url2 ='https://www.sports-reference.com/cbb/seasons/'
url3 ='-school-stats.html'

for n in n:
    all = url2+str(n)+url3
    r = requests.get(all)
    soup = BeautifulSoup(r.text, 'html.parser')
    league_table = soup.find('table', class_ = 'per_match_toggle sortable stats_table')
    for team in league_table.find_all('tbody'):
        rows = team.find_all('tr')
        for row in rows:
            pl_team = row.find('td', class_ = 'left')
            if pl_team == (None):
                continue
            pl_wins = row.find_all('td', class_ = 'right')[1]
            if pl_wins == (None):
                continue
            pl_loses = row.find_all('td', class_ = 'right')[2]
            if pl_wins == (None):
                continue
            pl_total_points = row.find_all('td', class_ = 'right')[16]
            if pl_total_points == (None):
                continue
            pl_total_points_againest = row.find_all('td', class_ = 'right')[17]
            if pl_total_points_againest == (None):
                continue
            pl_FG_percentage = row.find_all('td', class_ = 'right')[22]
            if pl_FG_percentage == (None):
                continue
            pl_3_percentage = row.find_all('td', class_ = 'right')[25]
            if pl_3_percentage == (None):
                continue
            pl_FT_percentage = row.find_all('td', class_ = 'right')[28]
            if pl_FT_percentage == (None):
                continue
            pl_total_rebounds = row.find_all('td', class_ = 'right')[30]
            if pl_total_rebounds == (None):
                continue
            pl_assist = row.find_all('td', class_ = 'right')[31]
            if pl_assist == (None):
                continue
            pl_steals = row.find_all('td', class_ = 'right')[32]
            if pl_steals == (None):
                continue
            pl_turnovers = row.find_all('td', class_ = 'right')[33]
            if pl_total_rebounds == (None):
                continue
            print(n,
                  pl_team.text, 
                  pl_wins.text, 
                  pl_loses.text, 
                  pl_total_points.text, 
                  pl_total_points_againest.text, 
                  pl_FG_percentage.text,
                  pl_3_percentage.text,
                  pl_FT_percentage.text, 
                  pl_total_rebounds.text,
                  pl_assist.text,
                  pl_steals.text,
                  pl_turnovers.text,
                 )
            data = {'Year': n,
                   'Team': pl_team.text,
                   'Wins': pl_wins.text}
输出给出了这个示例(1000多行):

但当我打印“数据”时,我只得到一行,我知道这不是数据帧,但我也无法使DF工作:

{'Year': 2020, 'Team': 'Youngstown State', 'Wins': '18'}

您为Python新手做出了巨大的努力。要知道的一件好事是,
DataFrame
在Python中不是本机数据结构,因此Python不理解什么是DataFrame,没有另一个库就无法创建DataFrame。因此,您需要下载并安装
Pandas
库,或许可以从阅读文档开始,以便了解如何从数据初始化数据帧

在代码中,您创建的
数据
变量是一个字典。我还要提醒您,当您使用循环时,所有变量都会在循环的每次迭代中重置,因此您需要以某种方式存储信息。为此,我建议使用列表:在循环的每次迭代中,将新数据附加到预先初始化的列表中(在循环之外)

然后,您可以将这些数据列表中的每一个传递给pd.DataFrame()方法来构造数据帧。你会看到我是如何选择这样做的

import pandas as pd
import requests
from bs4 import BeautifulSoup

n=range(2009,2021)

url2 ='https://www.sports-reference.com/cbb/seasons/'
url3 ='-school-stats.html'

## initialize all of the lists but make sure they don't refer to the same empty list
pl_team_list, pl_wins_list, pl_loses_list, pl_total_points_list, pl_total_points_againest_list, \
pl_FG_percentage_list, pl_3_percentage_list, pl_FT_percentage_list, pl_total_rebounds_list, \
pl_assist_list, pl_steals_list, pl_turnovers_list = ([] for i in range(12))

for n in n:
    all = url2+str(n)+url3
    r = requests.get(all)
    soup = BeautifulSoup(r.text, 'html.parser')
    league_table = soup.find('table', class_ = 'per_match_toggle sortable stats_table')
    for team in league_table.find_all('tbody'):
        rows = team.find_all('tr')
        for row in rows:
            pl_team = row.find('td', class_ = 'left')
            if pl_team == (None):
                continue
            pl_wins = row.find_all('td', class_ = 'right')[1]
            if pl_wins == (None):
                continue
            pl_loses = row.find_all('td', class_ = 'right')[2]
            if pl_wins == (None):
                continue
            pl_total_points = row.find_all('td', class_ = 'right')[16]
            if pl_total_points == (None):
                continue
            pl_total_points_againest = row.find_all('td', class_ = 'right')[17]
            if pl_total_points_againest == (None):
                continue
            pl_FG_percentage = row.find_all('td', class_ = 'right')[22]
            if pl_FG_percentage == (None):
                continue
            pl_3_percentage = row.find_all('td', class_ = 'right')[25]
            if pl_3_percentage == (None):
                continue
            pl_FT_percentage = row.find_all('td', class_ = 'right')[28]
            if pl_FT_percentage == (None):
                continue
            pl_total_rebounds = row.find_all('td', class_ = 'right')[30]
            if pl_total_rebounds == (None):
                continue
            pl_assist = row.find_all('td', class_ = 'right')[31]
            if pl_assist == (None):
                continue
            pl_steals = row.find_all('td', class_ = 'right')[32]
            if pl_steals == (None):
                continue
            pl_turnovers = row.find_all('td', class_ = 'right')[33]
            if pl_total_rebounds == (None):
                continue

            ## append the text to each list
            pl_team_list.append(pl_team.text)
            pl_wins_list.append(pl_wins.text)
            pl_loses_list.append(pl_loses.text)
            pl_total_points_list.append(pl_total_points.text)
            pl_total_points_againest_list.append(pl_total_points_againest.text)
            pl_FG_percentage_list.append(pl_FG_percentage.text)
            pl_3_percentage_list.append(pl_3_percentage.text)
            pl_FT_percentage_list.append(pl_FT_percentage.text)
            pl_total_rebounds_list.append(pl_total_rebounds.text)
            pl_assist_list.append(pl_assist.text)
            pl_steals_list.append(pl_steals.text)
            pl_turnovers_list.append(pl_turnovers.text)

            # print(n,
            #       pl_team.text, 
            #       pl_wins.text, 
            #       pl_loses.text, 
            #       pl_total_points.text, 
            #       pl_total_points_againest.text, 
            #       pl_FG_percentage.text,
            #       pl_3_percentage.text,
            #       pl_FT_percentage.text, 
            #       pl_total_rebounds.text,
            #       pl_assist.text,
            #       pl_steals.text,
            #       pl_turnovers.text,
            #      )

            # data = {'Year': n,
            #        'Team': pl_team.text,
            #        'Wins': pl_wins.text}

## initialize your DataFrame by passing a dictionary of your lists of data
df = pd.DataFrame({
  'Team': pl_team_list,
  'Wins': pl_wins_list,
  'Loses': pl_loses_list,
  'Total Points': pl_total_points_list,
  'Total Points Against': pl_total_points_againest_list,
  'FG Percentage': pl_FG_percentage_list,
  '3 Percentage': pl_3_percentage_list,
  'FT Percentage': pl_FT_percentage_list,
  'Total Rebounds': pl_total_rebounds_list,
  'Assist': pl_assist_list,
  'Steals': pl_steals_list,
  'Turnovers': pl_turnovers_list
  })
然后,
df
如下所示:


您为Python新手付出了巨大的努力。要知道的一件好事是,
DataFrame
在Python中不是本机数据结构,因此Python不理解什么是DataFrame,没有另一个库就无法创建DataFrame。因此,您需要下载并安装
Pandas
库,或许可以从阅读文档开始,以便了解如何从数据初始化数据帧

在代码中,您创建的
数据
变量是一个字典。我还要提醒您,当您使用循环时,所有变量都会在循环的每次迭代中重置,因此您需要以某种方式存储信息。为此,我建议使用列表:在循环的每次迭代中,将新数据附加到预先初始化的列表中(在循环之外)

然后,您可以将这些数据列表中的每一个传递给pd.DataFrame()方法来构造数据帧。你会看到我是如何选择这样做的

import pandas as pd
import requests
from bs4 import BeautifulSoup

n=range(2009,2021)

url2 ='https://www.sports-reference.com/cbb/seasons/'
url3 ='-school-stats.html'

## initialize all of the lists but make sure they don't refer to the same empty list
pl_team_list, pl_wins_list, pl_loses_list, pl_total_points_list, pl_total_points_againest_list, \
pl_FG_percentage_list, pl_3_percentage_list, pl_FT_percentage_list, pl_total_rebounds_list, \
pl_assist_list, pl_steals_list, pl_turnovers_list = ([] for i in range(12))

for n in n:
    all = url2+str(n)+url3
    r = requests.get(all)
    soup = BeautifulSoup(r.text, 'html.parser')
    league_table = soup.find('table', class_ = 'per_match_toggle sortable stats_table')
    for team in league_table.find_all('tbody'):
        rows = team.find_all('tr')
        for row in rows:
            pl_team = row.find('td', class_ = 'left')
            if pl_team == (None):
                continue
            pl_wins = row.find_all('td', class_ = 'right')[1]
            if pl_wins == (None):
                continue
            pl_loses = row.find_all('td', class_ = 'right')[2]
            if pl_wins == (None):
                continue
            pl_total_points = row.find_all('td', class_ = 'right')[16]
            if pl_total_points == (None):
                continue
            pl_total_points_againest = row.find_all('td', class_ = 'right')[17]
            if pl_total_points_againest == (None):
                continue
            pl_FG_percentage = row.find_all('td', class_ = 'right')[22]
            if pl_FG_percentage == (None):
                continue
            pl_3_percentage = row.find_all('td', class_ = 'right')[25]
            if pl_3_percentage == (None):
                continue
            pl_FT_percentage = row.find_all('td', class_ = 'right')[28]
            if pl_FT_percentage == (None):
                continue
            pl_total_rebounds = row.find_all('td', class_ = 'right')[30]
            if pl_total_rebounds == (None):
                continue
            pl_assist = row.find_all('td', class_ = 'right')[31]
            if pl_assist == (None):
                continue
            pl_steals = row.find_all('td', class_ = 'right')[32]
            if pl_steals == (None):
                continue
            pl_turnovers = row.find_all('td', class_ = 'right')[33]
            if pl_total_rebounds == (None):
                continue

            ## append the text to each list
            pl_team_list.append(pl_team.text)
            pl_wins_list.append(pl_wins.text)
            pl_loses_list.append(pl_loses.text)
            pl_total_points_list.append(pl_total_points.text)
            pl_total_points_againest_list.append(pl_total_points_againest.text)
            pl_FG_percentage_list.append(pl_FG_percentage.text)
            pl_3_percentage_list.append(pl_3_percentage.text)
            pl_FT_percentage_list.append(pl_FT_percentage.text)
            pl_total_rebounds_list.append(pl_total_rebounds.text)
            pl_assist_list.append(pl_assist.text)
            pl_steals_list.append(pl_steals.text)
            pl_turnovers_list.append(pl_turnovers.text)

            # print(n,
            #       pl_team.text, 
            #       pl_wins.text, 
            #       pl_loses.text, 
            #       pl_total_points.text, 
            #       pl_total_points_againest.text, 
            #       pl_FG_percentage.text,
            #       pl_3_percentage.text,
            #       pl_FT_percentage.text, 
            #       pl_total_rebounds.text,
            #       pl_assist.text,
            #       pl_steals.text,
            #       pl_turnovers.text,
            #      )

            # data = {'Year': n,
            #        'Team': pl_team.text,
            #        'Wins': pl_wins.text}

## initialize your DataFrame by passing a dictionary of your lists of data
df = pd.DataFrame({
  'Team': pl_team_list,
  'Wins': pl_wins_list,
  'Loses': pl_loses_list,
  'Total Points': pl_total_points_list,
  'Total Points Against': pl_total_points_againest_list,
  'FG Percentage': pl_FG_percentage_list,
  '3 Percentage': pl_3_percentage_list,
  'FT Percentage': pl_FT_percentage_list,
  'Total Rebounds': pl_total_rebounds_list,
  'Assist': pl_assist_list,
  'Steals': pl_steals_list,
  'Turnovers': pl_turnovers_list
  })
然后,
df
如下所示: