如何在Python中从一个页面中刮取和索引多个表?

如何在Python中从一个页面中刮取和索引多个表?,python,python-3.x,pandas,loops,web-scraping,Python,Python 3.x,Pandas,Loops,Web Scraping,我正在尝试使用维基百科页面将芝加哥的地区编号与社区区域进行匹配: 我知道如何一张表一张表地完成这项任务,但我相信有一个循环可以让这项任务变得更容易 但是,表中不包括区域名称,因此我可能需要更手动地将它们与联接或字典进行匹配 下面的代码可以工作,但是它将所有的表划入一个表中,所以我无法区分“边” 主要任务:我想使用每个表唯一的附加索引列(边名最好)来刮取所有表。有可能和熊猫一起做吗 一个小问题:有9个地区,但是,当我使用(0:8)公式时,最后一个表格丢失了,我不知道为什么。有没有一种方法可以通过

我正在尝试使用维基百科页面将芝加哥的地区编号与社区区域进行匹配:

我知道如何一张表一张表地完成这项任务,但我相信有一个循环可以让这项任务变得更容易

但是,表中不包括区域名称,因此我可能需要更手动地将它们与联接或字典进行匹配

下面的代码可以工作,但是它将所有的表划入一个表中,所以我无法区分“边”

  • 主要任务:我想使用每个表唯一的附加索引列(边名最好)来刮取所有表。有可能和熊猫一起做吗

  • 一个小问题:有9个地区,但是,当我使用(0:8)公式时,最后一个表格丢失了,我不知道为什么。有没有一种方法可以通过像len这样的东西来自动化这个范围


  • read_html()
    的好处是,当您需要解析
    标记时,它非常有用,但是
    标记之外的任何内容都无法抓取。因此,您需要使用BeautifulSoup来更具体地了解如何获取数据

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://en.wikipedia.org/wiki/Community_areas_in_Chicago'
    response = requests.get(url)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    tables = soup.find_all('table')
    results_df = pd.DataFrame()
    for table in tables:
        #table = tables[0]
        main_area = table.findPrevious('h3').text.split('[')[0].strip()
    
        try:
            sub_area = table.find('caption').text.strip()
        except:
            sub_area = 'N/A'
    
        rows = table.find_all('tr')
        for row in rows:
            #row = rows[1]
            data = row.find_all('td')
    
            try:    
                number = data[0].text.strip()
                com_area = data[1].text.strip()
    
                n_list = [ each.text.strip() for each in data[2].find_all('li') ]
                if n_list == []:
                    n_list = ['']
    
                for each in n_list:
                    temp_df = pd.DataFrame([[main_area, sub_area, number, com_area, each]], columns = ['Community area by side', 'Sub community area by side', 'Number', 'Community area', 'Neighborhoods'])
    
                    results_df = results_df.append(temp_df).reset_index(drop=True)
            except:
                continue
    
    输出:

    print (results_df.to_string())
        Community area by side Sub community area by side Number          Community area                     Neighborhoods
    0                  Central                        N/A     08         Near North Side                     Cabrini–Green
    1                  Central                        N/A     08         Near North Side                    The Gold Coast
    2                  Central                        N/A     08         Near North Side                      Goose Island
    3                  Central                        N/A     08         Near North Side                  Magnificent Mile
    4                  Central                        N/A     08         Near North Side                          Old Town
    5                  Central                        N/A     08         Near North Side                       River North
    6                  Central                        N/A     08         Near North Side                        River West
    7                  Central                        N/A     08         Near North Side                     Streeterville
    8                  Central                        N/A     32                    Loop                              Loop
    9                  Central                        N/A     32                    Loop                      New Eastside
    10                 Central                        N/A     32                    Loop                        South Loop
    11                 Central                        N/A     32                    Loop                    West Loop Gate
    12                 Central                        N/A     33         Near South Side                     Dearborn Park
    13                 Central                        N/A     33         Near South Side                     Printer's Row
    14                 Central                        N/A     33         Near South Side                        South Loop
    15                 Central                        N/A     33         Near South Side  Prairie Avenue Historic District
    16              North Side                 North Side     05            North Center                       Horner Park
    17              North Side                 North Side     05            North Center                    Roscoe Village
    18              North Side                 North Side     06               Lake View                          Boystown
    19              North Side                 North Side     06               Lake View                    Lake View East
    20              North Side                 North Side     06               Lake View                    Graceland West
    21              North Side                 North Side     06               Lake View             South East Ravenswood
    22              North Side                 North Side     06               Lake View                      Wrigleyville
    23              North Side                 North Side     07            Lincoln Park                 Old Town Triangle
    24              North Side                 North Side     07            Lincoln Park                         Park West
    25              North Side                 North Side     07            Lincoln Park                    Ranch Triangle
    26              North Side                 North Side     07            Lincoln Park               Sheffield Neighbors
    27              North Side                 North Side     07            Lincoln Park              Wrightwood Neighbors
    28              North Side                 North Side     21                Avondale                   Belmont Gardens
    29              North Side                 North Side     21                Avondale          Chicago's Polish Village
    30              North Side                 North Side     21                Avondale                   Kosciuszko Park
    31              North Side                 North Side     22            Logan Square                   Belmont Gardens
    32              North Side                 North Side     22            Logan Square                          Bucktown
    33              North Side                 North Side     22            Logan Square                   Kosciuszko Park
    34              North Side                 North Side     22            Logan Square                     Palmer Square
    35              North Side             Far North side     01             Rogers Park                  East Rogers Park
    36              North Side             Far North side     02              West Ridge                   Arcadia Terrace
    37              North Side             Far North side     02              West Ridge                     Peterson Park
    38              North Side             Far North side     02              West Ridge                  West Rogers Park
    39              North Side             Far North side     03                  Uptown                        Buena Park
    40              North Side             Far North side     03                  Uptown                     Argyle Street
    41              North Side             Far North side     03                  Uptown                      Margate Park
    42              North Side             Far North side     03                  Uptown                     Sheridan Park
    43              North Side             Far North side     04          Lincoln Square                        Ravenswood
    44              North Side             Far North side     04          Lincoln Square                Ravenswood Gardens 
    ...
    

    那么,您希望输出结果是什么?另外,使用
    read\u html()
    已经将表返回到列表中。您不需要迭代每个表并将其追加到df_列表中。此外,当您按索引[0:8]切片时,最后一个索引值也不包括在内。所以您需要执行[0:9],这将包括索引位置的表8@chitown88所需的输出是一个数据帧,其中包括数字、社区区域和表号或一些Id,以便iI可以将它们分配给“侧”-例如,第一个表将是“中心”,第二个是“北侧”等等。等一下。我有一个解决方案我编辑了这个问题-我用的是“社区区域”而不是“侧面”,这肯定是误导。谢谢!作为一名Python新手,我将不得不分析代码一段时间,但这正是我所需要的。最后一个问题——是否可以用更详细的社区区域(侧面)替换主要社区区域?例如,将北侧划分为北侧、远北侧、西北侧等。
    print (results_df.to_string())
        Community area by side Sub community area by side Number          Community area                     Neighborhoods
    0                  Central                        N/A     08         Near North Side                     Cabrini–Green
    1                  Central                        N/A     08         Near North Side                    The Gold Coast
    2                  Central                        N/A     08         Near North Side                      Goose Island
    3                  Central                        N/A     08         Near North Side                  Magnificent Mile
    4                  Central                        N/A     08         Near North Side                          Old Town
    5                  Central                        N/A     08         Near North Side                       River North
    6                  Central                        N/A     08         Near North Side                        River West
    7                  Central                        N/A     08         Near North Side                     Streeterville
    8                  Central                        N/A     32                    Loop                              Loop
    9                  Central                        N/A     32                    Loop                      New Eastside
    10                 Central                        N/A     32                    Loop                        South Loop
    11                 Central                        N/A     32                    Loop                    West Loop Gate
    12                 Central                        N/A     33         Near South Side                     Dearborn Park
    13                 Central                        N/A     33         Near South Side                     Printer's Row
    14                 Central                        N/A     33         Near South Side                        South Loop
    15                 Central                        N/A     33         Near South Side  Prairie Avenue Historic District
    16              North Side                 North Side     05            North Center                       Horner Park
    17              North Side                 North Side     05            North Center                    Roscoe Village
    18              North Side                 North Side     06               Lake View                          Boystown
    19              North Side                 North Side     06               Lake View                    Lake View East
    20              North Side                 North Side     06               Lake View                    Graceland West
    21              North Side                 North Side     06               Lake View             South East Ravenswood
    22              North Side                 North Side     06               Lake View                      Wrigleyville
    23              North Side                 North Side     07            Lincoln Park                 Old Town Triangle
    24              North Side                 North Side     07            Lincoln Park                         Park West
    25              North Side                 North Side     07            Lincoln Park                    Ranch Triangle
    26              North Side                 North Side     07            Lincoln Park               Sheffield Neighbors
    27              North Side                 North Side     07            Lincoln Park              Wrightwood Neighbors
    28              North Side                 North Side     21                Avondale                   Belmont Gardens
    29              North Side                 North Side     21                Avondale          Chicago's Polish Village
    30              North Side                 North Side     21                Avondale                   Kosciuszko Park
    31              North Side                 North Side     22            Logan Square                   Belmont Gardens
    32              North Side                 North Side     22            Logan Square                          Bucktown
    33              North Side                 North Side     22            Logan Square                   Kosciuszko Park
    34              North Side                 North Side     22            Logan Square                     Palmer Square
    35              North Side             Far North side     01             Rogers Park                  East Rogers Park
    36              North Side             Far North side     02              West Ridge                   Arcadia Terrace
    37              North Side             Far North side     02              West Ridge                     Peterson Park
    38              North Side             Far North side     02              West Ridge                  West Rogers Park
    39              North Side             Far North side     03                  Uptown                        Buena Park
    40              North Side             Far North side     03                  Uptown                     Argyle Street
    41              North Side             Far North side     03                  Uptown                      Margate Park
    42              North Side             Far North side     03                  Uptown                     Sheridan Park
    43              North Side             Far North side     04          Lincoln Square                        Ravenswood
    44              North Side             Far North side     04          Lincoln Square                Ravenswood Gardens 
    ...