Python将一个表拖入一个数据帧

Python将一个表拖入一个数据帧,python,pandas,dataframe,web-scraping,beautifulsoup,Python,Pandas,Dataframe,Web Scraping,Beautifulsoup,我试图弄清楚如何把它放到一个数据帧中,但是我似乎不知道怎么做。到目前为止,我一直试图从我在课堂上学到的一些东西出发,在这个论坛上发布了一个混合答案。但我还是不能让它工作。谁能帮我解释一下他们做了什么。我把我的代码放在下面: import requests import pandas from bs4 import BeautifulSoup page = requests.get("https://www.sports-reference.com/cbb/schools/duke/2

我试图弄清楚如何把它放到一个数据帧中,但是我似乎不知道怎么做。到目前为止,我一直试图从我在课堂上学到的一些东西出发,在这个论坛上发布了一个混合答案。但我还是不能让它工作。谁能帮我解释一下他们做了什么。我把我的代码放在下面:

import requests
import pandas
from bs4 import BeautifulSoup

page = requests.get("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")
soup = BeautifulSoup(page, "html.parser")
table = soup.find("table", attrs={"class":"sortable stats_table now_sortable"})
table_rows = table.find_all('tr')

l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
#test columns
df = pd.DataFrame(l, columns=["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"])
print(df)

你这里有几件事做错了。包括@Ferris提到的内容。这会让你开始的

import pandas as pd #read this in correcly as pd
page = requests.get("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")
soup = BeautifulSoup(page.text, "html.parser") # use page.text
# table = soup.find("table", attrs={"class":"sortable stats_table"})
table = soup.find("table", attrs={"id":"schedule"}) #use the id if available; couldn't get class to work when space is in class name
table_rows = table.find_all('tr')

# this works below as you have it but it doesn't read into the dataframe correctly
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
#test columns
# read without columns to see what you have
df = pd.DataFrame(l)
# df = pd.DataFrame(l, columns=["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"])
print(df)

有很多方法可以做同样的事情。这可能不是最好的方法,但它可以完成工作

import requests
import pandas as pd
from bs4 import BeautifulSoup

page = requests.get("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")
soup = BeautifulSoup(page.content, "html.parser")

table_header = soup.find_all("thead")[1]
table_header_rows = table_header.find_all('tr')
table_header_text = []
for tr in table_header_rows:
    th = tr.find_all('th')
    row = [tr.text for tr in th]
    table_header_text.append(row)

table_body = soup.find_all("tbody")[1]
table_body_rows = table_body.find_all('tr')
table_body_text = []
for tr in table_body_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    table_body_text.append(row)
    
pd.DataFrame(table_body_text, columns=table_header_text[0][1:])
我的解决方案

import pandas as pd

df = pd.read_html("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")[1]

# Generate a list of the new columns
new_columns = [chr(x) for x in range(ord('A'), ord('O')+1)]
columns = dict(zip(df.columns, new_columns)
df.rename(columns=columns, inplace=True)
print(df)

使用
page.text
获取web内容
df\u list=pd.read\u html(page.text)
我认为这个答案可以做得更好,如果它更详细一点的话。在它目前的状态下,它相当神秘。例如,这行代码
chr(x)对于范围(ord('A')、ord('O')+1)中的x有什么作用?我在一条注释中加了一个bitAhh ok,谢谢你解释每一行,这有助于更好地理解。
     A                  B       C    D    E                     F        G    H     I     J   K    L    M    N                                     O
0    1  Sat, Nov 28, 2020   2:00p  REG  NaN          Coppin State     MEAC    W  81.0  71.0 NaN  1.0  0.0  W 1                Cameron Indoor Stadium
1    2   Tue, Dec 1, 2020   7:30p  REG  NaN    Michigan State (8)  Big Ten    L  69.0  75.0 NaN  1.0  1.0  L 1                Cameron Indoor Stadium
2    3   Fri, Dec 4, 2020   7:00p  REG  NaN            Bellarmine    A-Sun    W  76.0  54.0 NaN  2.0  1.0  W 1                Cameron Indoor Stadium
3    4   Tue, Dec 8, 2020   9:30p  REG  NaN          Illinois (6)  Big Ten    L  68.0  83.0 NaN  2.0  2.0  L 1                Cameron Indoor Stadium
4    5  Wed, Dec 16, 2020   9:00p  REG    @            Notre Dame      ACC    W  75.0  65.0 NaN  3.0  2.0  W 1  Purcell Pavilion at the Joyce Center
5    6   Wed, Jan 6, 2021   8:30p  REG  NaN        Boston College      ACC    W  83.0  82.0 NaN  4.0  2.0  W 2                Cameron Indoor Stadium
6    7   Sat, Jan 9, 2021  12:00p  REG  NaN           Wake Forest      ACC    W  79.0  68.0 NaN  5.0  2.0  W 3                Cameron Indoor Stadium
7    8  Tue, Jan 12, 2021   7:00p  REG    @    Virginia Tech (20)      ACC    L  67.0  74.0 NaN  5.0  3.0  L 1                      Cassell Coliseum
8    9  Tue, Jan 19, 2021   9:00p  REG    @            Pittsburgh      ACC    L  73.0  79.0 NaN  5.0  4.0  L 2                Petersen Events Center
9   10  Sat, Jan 23, 2021   4:00p  REG    @            Louisville      ACC    L  65.0  70.0 NaN  5.0  5.0  L 3                       KFC Yum! Center
10  11  Tue, Jan 26, 2021   9:00p  REG  NaN          Georgia Tech      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
11  12  Sat, Jan 30, 2021  12:00p  REG  NaN          Clemson (20)      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
12  13   Mon, Feb 1, 2021   7:00p  REG    @            Miami (FL)      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
13  14   Sat, Feb 6, 2021   6:00p  REG  NaN        North Carolina      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
14  15   Tue, Feb 9, 2021   4:00p  REG  NaN            Notre Dame      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
15  16  Sat, Feb 13, 2021   4:00p  REG    @  North Carolina State      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
16  17  Wed, Feb 17, 2021   8:30p  REG    @           Wake Forest      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
17  18  Sat, Feb 20, 2021     NaN  REG  NaN         Virginia (13)      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
18  19  Mon, Feb 22, 2021   7:00p  REG  NaN              Syracuse      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
19  20  Sat, Feb 27, 2021   6:00p  REG  NaN            Louisville      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
20  21   Tue, Mar 2, 2021   7:00p  REG    @          Georgia Tech      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
21  22   Sat, Mar 6, 2021   6:00p  REG    @        North Carolina      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN