如何在Python中从一个页面中刮取和索引多个表?
我正在尝试使用维基百科页面将芝加哥的地区编号与社区区域进行匹配: 我知道如何一张表一张表地完成这项任务,但我相信有一个循环可以让这项任务变得更容易 但是,表中不包括区域名称,因此我可能需要更手动地将它们与联接或字典进行匹配 下面的代码可以工作,但是它将所有的表划入一个表中,所以我无法区分“边”如何在Python中从一个页面中刮取和索引多个表?,python,python-3.x,pandas,loops,web-scraping,Python,Python 3.x,Pandas,Loops,Web Scraping,我正在尝试使用维基百科页面将芝加哥的地区编号与社区区域进行匹配: 我知道如何一张表一张表地完成这项任务,但我相信有一个循环可以让这项任务变得更容易 但是,表中不包括区域名称,因此我可能需要更手动地将它们与联接或字典进行匹配 下面的代码可以工作,但是它将所有的表划入一个表中,所以我无法区分“边” 主要任务:我想使用每个表唯一的附加索引列(边名最好)来刮取所有表。有可能和熊猫一起做吗 一个小问题:有9个地区,但是,当我使用(0:8)公式时,最后一个表格丢失了,我不知道为什么。有没有一种方法可以通过
read_html()
的好处是,当您需要解析
标记时,它非常有用,但是
标记之外的任何内容都无法抓取。因此,您需要使用BeautifulSoup来更具体地了解如何获取数据
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Community_areas_in_Chicago'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
tables = soup.find_all('table')
results_df = pd.DataFrame()
for table in tables:
#table = tables[0]
main_area = table.findPrevious('h3').text.split('[')[0].strip()
try:
sub_area = table.find('caption').text.strip()
except:
sub_area = 'N/A'
rows = table.find_all('tr')
for row in rows:
#row = rows[1]
data = row.find_all('td')
try:
number = data[0].text.strip()
com_area = data[1].text.strip()
n_list = [ each.text.strip() for each in data[2].find_all('li') ]
if n_list == []:
n_list = ['']
for each in n_list:
temp_df = pd.DataFrame([[main_area, sub_area, number, com_area, each]], columns = ['Community area by side', 'Sub community area by side', 'Number', 'Community area', 'Neighborhoods'])
results_df = results_df.append(temp_df).reset_index(drop=True)
except:
continue
输出:
print (results_df.to_string())
Community area by side Sub community area by side Number Community area Neighborhoods
0 Central N/A 08 Near North Side Cabrini–Green
1 Central N/A 08 Near North Side The Gold Coast
2 Central N/A 08 Near North Side Goose Island
3 Central N/A 08 Near North Side Magnificent Mile
4 Central N/A 08 Near North Side Old Town
5 Central N/A 08 Near North Side River North
6 Central N/A 08 Near North Side River West
7 Central N/A 08 Near North Side Streeterville
8 Central N/A 32 Loop Loop
9 Central N/A 32 Loop New Eastside
10 Central N/A 32 Loop South Loop
11 Central N/A 32 Loop West Loop Gate
12 Central N/A 33 Near South Side Dearborn Park
13 Central N/A 33 Near South Side Printer's Row
14 Central N/A 33 Near South Side South Loop
15 Central N/A 33 Near South Side Prairie Avenue Historic District
16 North Side North Side 05 North Center Horner Park
17 North Side North Side 05 North Center Roscoe Village
18 North Side North Side 06 Lake View Boystown
19 North Side North Side 06 Lake View Lake View East
20 North Side North Side 06 Lake View Graceland West
21 North Side North Side 06 Lake View South East Ravenswood
22 North Side North Side 06 Lake View Wrigleyville
23 North Side North Side 07 Lincoln Park Old Town Triangle
24 North Side North Side 07 Lincoln Park Park West
25 North Side North Side 07 Lincoln Park Ranch Triangle
26 North Side North Side 07 Lincoln Park Sheffield Neighbors
27 North Side North Side 07 Lincoln Park Wrightwood Neighbors
28 North Side North Side 21 Avondale Belmont Gardens
29 North Side North Side 21 Avondale Chicago's Polish Village
30 North Side North Side 21 Avondale Kosciuszko Park
31 North Side North Side 22 Logan Square Belmont Gardens
32 North Side North Side 22 Logan Square Bucktown
33 North Side North Side 22 Logan Square Kosciuszko Park
34 North Side North Side 22 Logan Square Palmer Square
35 North Side Far North side 01 Rogers Park East Rogers Park
36 North Side Far North side 02 West Ridge Arcadia Terrace
37 North Side Far North side 02 West Ridge Peterson Park
38 North Side Far North side 02 West Ridge West Rogers Park
39 North Side Far North side 03 Uptown Buena Park
40 North Side Far North side 03 Uptown Argyle Street
41 North Side Far North side 03 Uptown Margate Park
42 North Side Far North side 03 Uptown Sheridan Park
43 North Side Far North side 04 Lincoln Square Ravenswood
44 North Side Far North side 04 Lincoln Square Ravenswood Gardens
...
那么,您希望输出结果是什么?另外,使用
read\u html()
已经将表返回到列表中。您不需要迭代每个表并将其追加到df_列表中。此外,当您按索引[0:8]切片时,最后一个索引值也不包括在内。所以您需要执行[0:9],这将包括索引位置的表8@chitown88所需的输出是一个数据帧,其中包括数字、社区区域和表号或一些Id,以便iI可以将它们分配给“侧”-例如,第一个表将是“中心”,第二个是“北侧”等等。等一下。我有一个解决方案我编辑了这个问题-我用的是“社区区域”而不是“侧面”,这肯定是误导。谢谢!作为一名Python新手,我将不得不分析代码一段时间,但这正是我所需要的。最后一个问题——是否可以用更详细的社区区域(侧面)替换主要社区区域?例如,将北侧划分为北侧、远北侧、西北侧等。
print (results_df.to_string())
Community area by side Sub community area by side Number Community area Neighborhoods
0 Central N/A 08 Near North Side Cabrini–Green
1 Central N/A 08 Near North Side The Gold Coast
2 Central N/A 08 Near North Side Goose Island
3 Central N/A 08 Near North Side Magnificent Mile
4 Central N/A 08 Near North Side Old Town
5 Central N/A 08 Near North Side River North
6 Central N/A 08 Near North Side River West
7 Central N/A 08 Near North Side Streeterville
8 Central N/A 32 Loop Loop
9 Central N/A 32 Loop New Eastside
10 Central N/A 32 Loop South Loop
11 Central N/A 32 Loop West Loop Gate
12 Central N/A 33 Near South Side Dearborn Park
13 Central N/A 33 Near South Side Printer's Row
14 Central N/A 33 Near South Side South Loop
15 Central N/A 33 Near South Side Prairie Avenue Historic District
16 North Side North Side 05 North Center Horner Park
17 North Side North Side 05 North Center Roscoe Village
18 North Side North Side 06 Lake View Boystown
19 North Side North Side 06 Lake View Lake View East
20 North Side North Side 06 Lake View Graceland West
21 North Side North Side 06 Lake View South East Ravenswood
22 North Side North Side 06 Lake View Wrigleyville
23 North Side North Side 07 Lincoln Park Old Town Triangle
24 North Side North Side 07 Lincoln Park Park West
25 North Side North Side 07 Lincoln Park Ranch Triangle
26 North Side North Side 07 Lincoln Park Sheffield Neighbors
27 North Side North Side 07 Lincoln Park Wrightwood Neighbors
28 North Side North Side 21 Avondale Belmont Gardens
29 North Side North Side 21 Avondale Chicago's Polish Village
30 North Side North Side 21 Avondale Kosciuszko Park
31 North Side North Side 22 Logan Square Belmont Gardens
32 North Side North Side 22 Logan Square Bucktown
33 North Side North Side 22 Logan Square Kosciuszko Park
34 North Side North Side 22 Logan Square Palmer Square
35 North Side Far North side 01 Rogers Park East Rogers Park
36 North Side Far North side 02 West Ridge Arcadia Terrace
37 North Side Far North side 02 West Ridge Peterson Park
38 North Side Far North side 02 West Ridge West Rogers Park
39 North Side Far North side 03 Uptown Buena Park
40 North Side Far North side 03 Uptown Argyle Street
41 North Side Far North side 03 Uptown Margate Park
42 North Side Far North side 03 Uptown Sheridan Park
43 North Side Far North side 04 Lincoln Square Ravenswood
44 North Side Far North side 04 Lincoln Square Ravenswood Gardens
...