使用Beautiful Soup在Python中创建带有行标签的表_Python_Dataframe_Beautifulsoup

使用Beautiful Soup在Python中创建带有行标签的表

python dataframe

使用Beautiful Soup在Python中创建带有行标签的表,python,dataframe,beautifulsoup,Python,Dataframe,Beautifulsoup,我正试图从一个有行标签的网站上抓取一个表。我能够从表中获取实际数据，但我不知道如何获取行标签这是我现在的代码： import numpy as np import pandas as pd import urllib.request from bs4 import BeautifulSoup url = "http://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Rp-eng.cfm?TABID=2&LANG=E

我正试图从一个有行标签的网站上抓取一个表。我能够从表中获取实际数据，但我不知道如何获取行标签

这是我现在的代码：

import numpy as np
import pandas as pd  
import urllib.request
from bs4 import BeautifulSoup

url = "http://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Rp-eng.cfm?TABID=2&LANG=E&A=R&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=01&GL=-1&GID=1341679&GK=1&GRP=1&O=D&PID=110719&PRID=10&PTYPE=109445&S=0&SHOWALL=0&SUB=0&Temporal=2017&THEME=125&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0"
res = urllib.request.urlopen(url)

html = res.read()

## parse with BeautifulSoup
bs = BeautifulSoup(html, "html.parser")

tables = bs.find_all("table")
table = tables[0]

df = pd.DataFrame()

rows = table.find_all("tr")

#extract the first column name (Employment income groups (18))
column_names = []
header_cells = rows[0].find_all("th") 

for cell in header_cells:
    header = cell.text
    header = header.strip()
    header = header.replace("\n", " ")
    column_names.append(header)

#extract the rest of the column names
header_cells = rows[1].find_all("th") 

for cell in header_cells:
    header = cell.text
    header = header.strip()
    header = header.replace("\n", " ")
    column_names.append(header)

#this is an extra label
column_names.remove('Main mode of commuting (10)')

#get the data from the table
data = []
for row in rows[2:]:

    ## create an empty tuple
    dt = ()

    cells = row.find_all("td")

    for cell in cells:
        ## dp stands for "data point"
        font = cell.find("font")

        if font is not None:
            dp = font.text
        else:
            dp = cell.text

        dp = dp.strip()
        dp = dp.replace("\n", " ")

        ## add to tuple
        dt = dt + (dp,)
    data.append(dt)

df = pd.DataFrame(data, columns = column_names)

创建dataframe会出现错误，因为上面的代码只提取带有数据点的单元格，而不提取包含行标签的每行的第一个单元格

也就是说，有11个列名，但是元组只有10个值，因为它没有提取行标签，即Total-Employment income，因为它们是th类型

在处理表中的其余数据时，如何获取行标签并将其放入元组中

谢谢你的帮助

如果从代码中看不清楚，那么我要刮取的表就在这个表上。使用这个表。findAll'th'，{'headers'：'col-0}查找行标签

lab = []
labels = table.findAll('th',{'headers':'col-0'})
for label in labels:

    data = str(label.text).strip()
    data = str(data).split("($)Footnote", 1)[0]

    lab.append(data)
    #print(data)

编辑：使用pandas.read_html

import numpy as np
import pandas as pd  
import urllib.request
from bs4 import BeautifulSoup

url = "http://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Rp-eng.cfm?TABID=2&LANG=E&A=R&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=01&GL=-1&GID=1341679&GK=1&GRP=1&O=D&PID=110719&PRID=10&PTYPE=109445&S=0&SHOWALL=0&SUB=0&Temporal=2017&THEME=125&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0"
res = urllib.request.urlopen(url)

html = res.read()

## parse with BeautifulSoup
bs = BeautifulSoup(html, "html.parser")

tables = bs.find_all("table")

df = (pd.read_html(str(tables)))[0]
#print(df)
columns = ['Employment income groups (18)','Total - Main mode of commuting','Car, truck or van','Driver, alone',
          '2 or more persons shared the ride to work','Driver, with 1 or more passengers',
         'Passenger, 2 or more persons in the vehicle','Sustainable transportation',
         'Public transit','Active transport','Other method']
df.columns = columns

编辑2：索引无法访问元素，因为字符串不是正确的字符串就业收入组18列标签。我再次编辑了代码。

谢谢！为了将标签应用于dataframe表中的行，是单独获取标签值更好，还是作为获取其余单元格数据的循环的一部分更好？另外，你能解释一下{'headers'：'col-0'}位吗？我知道它正在检查网页源中的单词标题和col-0，但我不理解语法。请参阅Total-Employment income和此html就业收入组18@firehawk12-我有一个更好的解决方案。我已经编辑了答案，请仔细阅读：