Web scraping 使用BeautifulSoup解析Python中的HTML表

Web scraping 使用BeautifulSoup解析Python中的HTML表,web-scraping,beautifulsoup,Web Scraping,Beautifulsoup,我试图从第二个表(即specifications选项卡)开始访问数据,但我的代码只返回第一个表中的数据。通过阅读许多其他帖子,我得出了以下与创建我想要的列表不相符合的结论: from bs4 import BeautifulSoup import csv html = "http://www.carwale.com/marutisuzuki-cars/baleno/sigma12/" html_content = requests.get(html).text soup = BeautifulS

我试图从第二个表(即specifications选项卡)开始访问数据,但我的代码只返回第一个表中的数据。通过阅读许多其他帖子,我得出了以下与创建我想要的列表不相符合的结论:

from bs4 import BeautifulSoup
import csv
html = "http://www.carwale.com/marutisuzuki-cars/baleno/sigma12/"
html_content = requests.get(html).text
soup = BeautifulSoup(html_content, "lxml")
table = soup.find("table")


output_rows = []
for table_row in table.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)

output_rows
.find()
将只返回找到的第一个元素/标记。您想使用
.find_all()
,它将返回所有指定元素/标记的列表

不过,在这种情况下,我可以推荐熊猫吗。熊猫的
.read_html()
在引擎盖下使用beautifulsoup,并查找那些
标签。然后,它将它们作为数据帧列表返回。这只是选择所需表的索引位置的问题。查看站点,查找索引位置1-4中返回的表:

import pandas as pd

dfs = pd.read_html('http://www.carwale.com/marutisuzuki-cars/baleno/sigma12/')

result = pd.DataFrame()
for df in dfs[1:5]:
    result = result.append(df, sort=False).reset_index(drop=True)
输出:

print (result)
                         0                                                  1
0                   Engine  1197cc, 4 Cylinders Inline, 4 Valves/Cylinder,...
1              Engine Type                                                VVT
2                Fuel Type                                             Petrol
3      Max Power (bhp@rpm)                                  82 bhp @ 6000 rpm
4      Max Torque (Nm@rpm)                                  115 Nm @ 4000 rpm
5           Mileage (ARAI)                                         21.01 kmpl
6               Drivetrain                                                FWD
7             Transmission                                   Manual - 5 Gears
8        Emission Standard                                               BS 6
9                   Length                                            3995 mm
10                   Width                                            1745 mm
11                  Height                                            1510 mm
12               Wheelbase                                            2520 mm
13        Ground Clearance                                             170 mm
14             Kerb Weight                                             865 kg
15                   Doors                                            5 Doors
16        Seating Capacity                                           5 Person
17      No of Seating Rows                                             2 Rows
18               Bootspace                                         339 litres
19      Fuel Tank Capacity                                          37 litres
20        Suspension Front                                    McPherson Strut
21         Suspension Rear                                       Torsion Beam
22        Front Brake Type                                               Disc
23         Rear Brake Type                                               Drum
24  Minimum Turning Radius                                         4.9 metres
25           Steering Type                          Power assisted (Electric)
26                  Wheels                                         Steel Rims
27             Spare Wheel                                              Steel
28             Front Tyres                                       185 / 65 R15
29              Rear Tyres                                       185 / 65 R15
from bs4 import BeautifulSoup

html = "http://www.carwale.com/marutisuzuki-cars/baleno/sigma12/"
html_content = requests.get(html).text
soup = BeautifulSoup(html_content, "lxml")
tables= soup.select("table.specs:not(.features)")


output_rows = []
for table in tables:
    for table_row in table.findAll('tr'):
        columns = table_row.findAll('td')
        output_row = []
        for column in columns:
           output_row.append(column.text.strip())
        output_rows.append(output_row)

print(output_rows)

当您以特定表为目标时,需要选择表的类名。请尝试以下
css
选择器

代码:

print (result)
                         0                                                  1
0                   Engine  1197cc, 4 Cylinders Inline, 4 Valves/Cylinder,...
1              Engine Type                                                VVT
2                Fuel Type                                             Petrol
3      Max Power (bhp@rpm)                                  82 bhp @ 6000 rpm
4      Max Torque (Nm@rpm)                                  115 Nm @ 4000 rpm
5           Mileage (ARAI)                                         21.01 kmpl
6               Drivetrain                                                FWD
7             Transmission                                   Manual - 5 Gears
8        Emission Standard                                               BS 6
9                   Length                                            3995 mm
10                   Width                                            1745 mm
11                  Height                                            1510 mm
12               Wheelbase                                            2520 mm
13        Ground Clearance                                             170 mm
14             Kerb Weight                                             865 kg
15                   Doors                                            5 Doors
16        Seating Capacity                                           5 Person
17      No of Seating Rows                                             2 Rows
18               Bootspace                                         339 litres
19      Fuel Tank Capacity                                          37 litres
20        Suspension Front                                    McPherson Strut
21         Suspension Rear                                       Torsion Beam
22        Front Brake Type                                               Disc
23         Rear Brake Type                                               Drum
24  Minimum Turning Radius                                         4.9 metres
25           Steering Type                          Power assisted (Electric)
26                  Wheels                                         Steel Rims
27             Spare Wheel                                              Steel
28             Front Tyres                                       185 / 65 R15
29              Rear Tyres                                       185 / 65 R15
from bs4 import BeautifulSoup

html = "http://www.carwale.com/marutisuzuki-cars/baleno/sigma12/"
html_content = requests.get(html).text
soup = BeautifulSoup(html_content, "lxml")
tables= soup.select("table.specs:not(.features)")


output_rows = []
for table in tables:
    for table_row in table.findAll('tr'):
        columns = table_row.findAll('td')
        output_row = []
        for column in columns:
           output_row.append(column.text.strip())
        output_rows.append(output_row)

print(output_rows)
输出

[['Engine', '1197cc, 4 Cylinders Inline, 4 Valves/Cylinder, DOHC'], ['Engine Type', 'VVT'], ['Fuel Type', 'Petrol'], ['Max Power (bhp@rpm)', '82 bhp @ 6000 rpm'], ['Max Torque (Nm@rpm)', '115 Nm @ 4000 rpm'], ['Mileage (ARAI)', '21.01 kmpl'], ['Drivetrain', 'FWD'], ['Transmission', 'Manual - 5 Gears'], ['Emission Standard', 'BS 6'], ['Length', '3995 mm'], ['Width', '1745 mm'], ['Height', '1510 mm'], ['Wheelbase', '2520 mm'], ['Ground Clearance', '170 mm'], ['Kerb Weight', '865 kg'], ['Doors', '5 Doors'], ['Seating Capacity', '5 Person'], ['No of Seating Rows', '2 Rows'], ['Bootspace', '339 litres'], ['Fuel Tank Capacity', '37 litres'], ['Suspension Front', 'McPherson Strut'], ['Suspension Rear', 'Torsion Beam'], ['Front Brake Type', 'Disc'], ['Rear Brake Type', 'Drum'], ['Minimum Turning Radius', '4.9 metres'], ['Steering Type', 'Power assisted (Electric)'], ['Wheels', 'Steel Rims'], ['Spare Wheel', 'Steel'], ['Front Tyres', '185 / 65 R15'], ['Rear Tyres', '185 / 65 R15']]