Python将一个表拖入一个数据帧
我试图弄清楚如何把它放到一个数据帧中,但是我似乎不知道怎么做。到目前为止,我一直试图从我在课堂上学到的一些东西出发,在这个论坛上发布了一个混合答案。但我还是不能让它工作。谁能帮我解释一下他们做了什么。我把我的代码放在下面:Python将一个表拖入一个数据帧,python,pandas,dataframe,web-scraping,beautifulsoup,Python,Pandas,Dataframe,Web Scraping,Beautifulsoup,我试图弄清楚如何把它放到一个数据帧中,但是我似乎不知道怎么做。到目前为止,我一直试图从我在课堂上学到的一些东西出发,在这个论坛上发布了一个混合答案。但我还是不能让它工作。谁能帮我解释一下他们做了什么。我把我的代码放在下面: import requests import pandas from bs4 import BeautifulSoup page = requests.get("https://www.sports-reference.com/cbb/schools/duke/2
import requests
import pandas
from bs4 import BeautifulSoup
page = requests.get("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")
soup = BeautifulSoup(page, "html.parser")
table = soup.find("table", attrs={"class":"sortable stats_table now_sortable"})
table_rows = table.find_all('tr')
l = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
l.append(row)
#test columns
df = pd.DataFrame(l, columns=["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"])
print(df)
你这里有几件事做错了。包括@Ferris提到的内容。这会让你开始的
import pandas as pd #read this in correcly as pd
page = requests.get("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")
soup = BeautifulSoup(page.text, "html.parser") # use page.text
# table = soup.find("table", attrs={"class":"sortable stats_table"})
table = soup.find("table", attrs={"id":"schedule"}) #use the id if available; couldn't get class to work when space is in class name
table_rows = table.find_all('tr')
# this works below as you have it but it doesn't read into the dataframe correctly
l = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
l.append(row)
#test columns
# read without columns to see what you have
df = pd.DataFrame(l)
# df = pd.DataFrame(l, columns=["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"])
print(df)
有很多方法可以做同样的事情。这可能不是最好的方法,但它可以完成工作
import requests
import pandas as pd
from bs4 import BeautifulSoup
page = requests.get("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")
soup = BeautifulSoup(page.content, "html.parser")
table_header = soup.find_all("thead")[1]
table_header_rows = table_header.find_all('tr')
table_header_text = []
for tr in table_header_rows:
th = tr.find_all('th')
row = [tr.text for tr in th]
table_header_text.append(row)
table_body = soup.find_all("tbody")[1]
table_body_rows = table_body.find_all('tr')
table_body_text = []
for tr in table_body_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
table_body_text.append(row)
pd.DataFrame(table_body_text, columns=table_header_text[0][1:])
我的解决方案
import pandas as pd
df = pd.read_html("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")[1]
# Generate a list of the new columns
new_columns = [chr(x) for x in range(ord('A'), ord('O')+1)]
columns = dict(zip(df.columns, new_columns)
df.rename(columns=columns, inplace=True)
print(df)
使用
page.text
获取web内容df\u list=pd.read\u html(page.text)
我认为这个答案可以做得更好,如果它更详细一点的话。在它目前的状态下,它相当神秘。例如,这行代码chr(x)对于范围(ord('A')、ord('O')+1)中的x有什么作用?我在一条注释中加了一个bitAhh ok,谢谢你解释每一行,这有助于更好地理解。
A B C D E F G H I J K L M N O
0 1 Sat, Nov 28, 2020 2:00p REG NaN Coppin State MEAC W 81.0 71.0 NaN 1.0 0.0 W 1 Cameron Indoor Stadium
1 2 Tue, Dec 1, 2020 7:30p REG NaN Michigan State (8) Big Ten L 69.0 75.0 NaN 1.0 1.0 L 1 Cameron Indoor Stadium
2 3 Fri, Dec 4, 2020 7:00p REG NaN Bellarmine A-Sun W 76.0 54.0 NaN 2.0 1.0 W 1 Cameron Indoor Stadium
3 4 Tue, Dec 8, 2020 9:30p REG NaN Illinois (6) Big Ten L 68.0 83.0 NaN 2.0 2.0 L 1 Cameron Indoor Stadium
4 5 Wed, Dec 16, 2020 9:00p REG @ Notre Dame ACC W 75.0 65.0 NaN 3.0 2.0 W 1 Purcell Pavilion at the Joyce Center
5 6 Wed, Jan 6, 2021 8:30p REG NaN Boston College ACC W 83.0 82.0 NaN 4.0 2.0 W 2 Cameron Indoor Stadium
6 7 Sat, Jan 9, 2021 12:00p REG NaN Wake Forest ACC W 79.0 68.0 NaN 5.0 2.0 W 3 Cameron Indoor Stadium
7 8 Tue, Jan 12, 2021 7:00p REG @ Virginia Tech (20) ACC L 67.0 74.0 NaN 5.0 3.0 L 1 Cassell Coliseum
8 9 Tue, Jan 19, 2021 9:00p REG @ Pittsburgh ACC L 73.0 79.0 NaN 5.0 4.0 L 2 Petersen Events Center
9 10 Sat, Jan 23, 2021 4:00p REG @ Louisville ACC L 65.0 70.0 NaN 5.0 5.0 L 3 KFC Yum! Center
10 11 Tue, Jan 26, 2021 9:00p REG NaN Georgia Tech ACC NaN NaN NaN NaN NaN NaN NaN NaN
11 12 Sat, Jan 30, 2021 12:00p REG NaN Clemson (20) ACC NaN NaN NaN NaN NaN NaN NaN NaN
12 13 Mon, Feb 1, 2021 7:00p REG @ Miami (FL) ACC NaN NaN NaN NaN NaN NaN NaN NaN
13 14 Sat, Feb 6, 2021 6:00p REG NaN North Carolina ACC NaN NaN NaN NaN NaN NaN NaN NaN
14 15 Tue, Feb 9, 2021 4:00p REG NaN Notre Dame ACC NaN NaN NaN NaN NaN NaN NaN NaN
15 16 Sat, Feb 13, 2021 4:00p REG @ North Carolina State ACC NaN NaN NaN NaN NaN NaN NaN NaN
16 17 Wed, Feb 17, 2021 8:30p REG @ Wake Forest ACC NaN NaN NaN NaN NaN NaN NaN NaN
17 18 Sat, Feb 20, 2021 NaN REG NaN Virginia (13) ACC NaN NaN NaN NaN NaN NaN NaN NaN
18 19 Mon, Feb 22, 2021 7:00p REG NaN Syracuse ACC NaN NaN NaN NaN NaN NaN NaN NaN
19 20 Sat, Feb 27, 2021 6:00p REG NaN Louisville ACC NaN NaN NaN NaN NaN NaN NaN NaN
20 21 Tue, Mar 2, 2021 7:00p REG @ Georgia Tech ACC NaN NaN NaN NaN NaN NaN NaN NaN
21 22 Sat, Mar 6, 2021 6:00p REG @ North Carolina ACC NaN NaN NaN NaN NaN NaN NaN NaN