Python 如何从表中刮取第二列_Python_Web Scraping_Beautifulsoup

Python 如何从表中刮取第二列

python web-scraping

Python 如何从表中刮取第二列,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正试图从表的第二列中提取数据，但失败了这是我的密码： import bs4 import requests url = "https://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom"` data=requests.get(url) soup=bs4.BeautifulSoup(data.text,'html.parser') My_table = soup.find('table',{'cl

我正试图从表的第二列中提取数据，但失败了

这是我的密码：

import bs4
import requests 
url = "https://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom"`

data=requests.get(url)
soup=bs4.BeautifulSoup(data.text,'html.parser')
My_table = soup.find('table',{'class':'wikitable sortable'})
#print(My_table)
My_row = My_table.find_all('tr')
#print(My_row[1])
for row in My_row:
   data= (row.find('td')[1].text)
   print(data)

以下是错误：

TypeError:“int”对象不可下标

最好的解决方案是什么？

您可以使用pandas的read\u html

import pandas as pd

tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom')
print(tables[1][1])

所有4列都使用：

print(tables[1])

它是一个数据帧，因此您可以根据需要进行切片。它返回[1503行x 4列]

您可以从pandas中使用read\u html

import pandas as pd

tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom')
print(tables[1][1])

所有4列都使用：

print(tables[1])

它是一个数据帧，因此您可以根据需要进行切片。它返回[1503行x 4列]

这段代码似乎有效

import bs4
import requests

url = "https://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom"

data = requests.get(url)
soup = bs4.BeautifulSoup(data.text, 'html.parser')
table = soup.find('table', {'class': 'wikitable sortable'})
rows = table.find_all('tr')
for i, row in enumerate(rows):
    if i > 0:
        for j, td in enumerate(row.children):
            if j == 3:
                print(td.text.strip())

这个代码似乎有效

import bs4
import requests

url = "https://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom"

data = requests.get(url)
soup = bs4.BeautifulSoup(data.text, 'html.parser')
table = soup.find('table', {'class': 'wikitable sortable'})
rows = table.find_all('tr')
for i, row in enumerate(rows):
    if i > 0:
        for j, td in enumerate(row.children):
            if j == 3:
                print(td.text.strip())

Beauty Soup（4.7+）的最新版本使用名为Soup Sieve的包，该包提供选择器支持。我个人觉得CSS选择器更容易用于这类事情。您可以在此处查看它支持的所有CSS选择器功能：

使用选择器，这个问题实际上很容易解决。如果实现了CSS级别4

：nth col

支持，这将更加容易，但如果没有：

在这里，我们只是针对表，并针对每一行中的每秒

td

元素

import bs4
import requests

url = "https://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom"

data = requests.get(url)
soup = bs4.BeautifulSoup(data.text, 'html.parser')

for td in soup.select('table.wikitable.sortable tr td:nth-child(2)'):
    print(td.text.strip())

截断输出：

AB10, AB11, AB12, AB15, AB16, AB21, AB22, AB23, AB24, AB25, AB99non-geo
AB13
AB14
AB30
AB31
AB32
AB33
AB34
AB35
AB36
AB37
AB38
AB39
AB41
AB42
AB43
AB44
AB45
AB51
AB52
AB53
AB54
AB55
AB56
AL01 AL1, AL2, AL3, AL4
AL05 AL5
AL06 AL6, AL7shared
AL07 AL7shared, AL8
AL09 AL9, AL10
B001 B1, B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, B15, B16, B17, B18, B19, B20, B21, B23, B24, B25, B26, B27, B28, B29, B30, B31, B32, B33, B34, B35, B36, B37, B38, B40, B42, B43, B44, B45, B46, B47, B48,B99non-geo
B049 B49, B50

使用选择器，这个问题实际上很容易解决。如果实现了CSS级别4

：nth col

支持，这将更加容易，但如果没有：

在这里，我们只是针对表，并针对每一行中的每秒

td

元素

import bs4
import requests

url = "https://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom"

data = requests.get(url)
soup = bs4.BeautifulSoup(data.text, 'html.parser')

for td in soup.select('table.wikitable.sortable tr td:nth-child(2)'):
    print(td.text.strip())

截断输出：

AB10, AB11, AB12, AB15, AB16, AB21, AB22, AB23, AB24, AB25, AB99non-geo
AB13
AB14
AB30
AB31
AB32
AB33
AB34
AB35
AB36
AB37
AB38
AB39
AB41
AB42
AB43
AB44
AB45
AB51
AB52
AB53
AB54
AB55
AB56
AL01 AL1, AL2, AL3, AL4
AL05 AL5
AL06 AL6, AL7shared
AL07 AL7shared, AL8
AL09 AL9, AL10
B001 B1, B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, B15, B16, B17, B18, B19, B20, B21, B23, B24, B25, B26, B27, B28, B29, B30, B31, B32, B33, B34, B35, B36, B37, B38, B40, B42, B43, B44, B45, B46, B47, B48,B99non-geo
B049 B49, B50

尝试下面的代码。它应该可以工作。它将返回第二列文本

import bs4
import requests
url="https://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom"

data=requests.get(url)
soup=bs4.BeautifulSoup(data.text,'html.parser')
My_table = soup.find('table',{'class':'wikitable sortable'})
My_row = My_table.find_all('tr')
for row in My_row:
   data= row.find_next('td').find_next('td')
   print(data.text.strip())

输出：

AB10, AB11, AB12, AB15, AB16, AB21, AB22, AB23, AB24, AB25, AB99non-geo
AB10, AB11, AB12, AB15, AB16, AB21, AB22, AB23, AB24, AB25, AB99non-geo
AB13
AB14
AB30
AB31
AB32
AB33
AB34
AB35
AB36
AB37
AB38
AB39
AB41
AB42
AB43
AB44
AB45
AB51
AB52
AB53
AB54
AB55
AB56
AL01 AL1, AL2, AL3, AL4
AL05 AL5
AL06 AL6, AL7shared
AL07 AL7shared, AL8
AL09 AL9, AL10
B001 B1, B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, B15, B16, B17, B18, B19, B20, B21, B23, B24, B25, B26, B27, B28, B29, B30, B31, B32, B33, B34, B35, B36, B37, B38, B40, B42, B43, B44, B45, B46, B47, B48,B99non-geo
B049 B49, B50
B060 B60, B61
B062 B62, B63
B064 B64
B065 B65
B066 B66, B67
B068 B68, B69
B070 B70, B71
B072  B72, B73, B74, B75, B76
B077 B77, B78, B79
B080 B80
B090 B90, B91, B92, B93, B94
B095 B95
B096 B96, B97, B98
BA01 BA1, BA2
BA03 BA3
BA04 BA4
BA05 BA5
BA06 BA6
BA07 BA7
BA08 BA8
BA09 BA9shared
BA09 BA9,non-geo shared[2] BA10
BA11
BA12
BA13
BA14
BA15
BA16
BA20, BA21, BA22
BB01 BB1, BB2, BB6
BB03 BB3
BB04 BB4
BB05 BB5
BB07 BB7
BB08 BB8
BB09 BB9
BB10, BB11, BB12
BB18, BB94non-geo
BD01 BD1, BD2, BD3, BD4, BD5, BD6, BD7, BD8, BD9, BD10, BD11, BD12, BD13, BD14, BD15, BD98,non-geo shared BD99non-geo
BD16, BD97non-geo
BD17, BD18, BD98non-geo shared
BD19
BD20, BD21, BD22
BD23, BD24shared
BD24shared
BF01 BF1non-geo
BH01 BH1, BH2, BH3, BH4, BH5, BH6, BH7, BH8, BH9, BH10, BH11
BH12, BH13, BH14, BH15, BH16, BH17
BH18
BH19
BH20
BH21
BH22
BH23
BH24
BH25
BH31
BL00 BL0,

尝试下面的代码。它应该可以工作。它将返回第二列文本

import bs4
import requests
url="https://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom"

data=requests.get(url)
soup=bs4.BeautifulSoup(data.text,'html.parser')
My_table = soup.find('table',{'class':'wikitable sortable'})
My_row = My_table.find_all('tr')
for row in My_row:
   data= row.find_next('td').find_next('td')
   print(data.text.strip())

输出：

AB10, AB11, AB12, AB15, AB16, AB21, AB22, AB23, AB24, AB25, AB99non-geo
AB10, AB11, AB12, AB15, AB16, AB21, AB22, AB23, AB24, AB25, AB99non-geo
AB13
AB14
AB30
AB31
AB32
AB33
AB34
AB35
AB36
AB37
AB38
AB39
AB41
AB42
AB43
AB44
AB45
AB51
AB52
AB53
AB54
AB55
AB56
AL01 AL1, AL2, AL3, AL4
AL05 AL5
AL06 AL6, AL7shared
AL07 AL7shared, AL8
AL09 AL9, AL10
B001 B1, B2, B3, B4, B5, B6, B7, B8, B9, B10, B11, B12, B13, B14, B15, B16, B17, B18, B19, B20, B21, B23, B24, B25, B26, B27, B28, B29, B30, B31, B32, B33, B34, B35, B36, B37, B38, B40, B42, B43, B44, B45, B46, B47, B48,B99non-geo
B049 B49, B50
B060 B60, B61
B062 B62, B63
B064 B64
B065 B65
B066 B66, B67
B068 B68, B69
B070 B70, B71
B072  B72, B73, B74, B75, B76
B077 B77, B78, B79
B080 B80
B090 B90, B91, B92, B93, B94
B095 B95
B096 B96, B97, B98
BA01 BA1, BA2
BA03 BA3
BA04 BA4
BA05 BA5
BA06 BA6
BA07 BA7
BA08 BA8
BA09 BA9shared
BA09 BA9,non-geo shared[2] BA10
BA11
BA12
BA13
BA14
BA15
BA16
BA20, BA21, BA22
BB01 BB1, BB2, BB6
BB03 BB3
BB04 BB4
BB05 BB5
BB07 BB7
BB08 BB8
BB09 BB9
BB10, BB11, BB12
BB18, BB94non-geo
BD01 BD1, BD2, BD3, BD4, BD5, BD6, BD7, BD8, BD9, BD10, BD11, BD12, BD13, BD14, BD15, BD98,non-geo shared BD99non-geo
BD16, BD97non-geo
BD17, BD18, BD98non-geo shared
BD19
BD20, BD21, BD22
BD23, BD24shared
BD24shared
BF01 BF1non-geo
BH01 BH1, BH2, BH3, BH4, BH5, BH6, BH7, BH8, BH9, BH10, BH11
BH12, BH13, BH14, BH15, BH16, BH17
BH18
BH19
BH20
BH21
BH22
BH23
BH24
BH25
BH31
BL00 BL0,

问题在于：

data=（row.find（'td'）[1].text）

：“row.find（td）”不返回数组。问：什么是Python调试器或IDE？建议：1）重构代码，以

rowdata=row.find（'td'）

，2）设置断点，然后3）确定rowdata的“类型”？@paulsm4我正在使用Spyder问题在于：

data=（row.find（'td'）[1].text）

：“row.find（td）”不返回数组。问：什么是Python调试器或IDE？建议：1）重构代码，以

rowdata=row.find（'td'）

，2）设置断点，然后3）确定rowdata的“类型”？@paulsm4我使用Spyderin这种方式，我们无法获取所有数据。您想要一列吗？否则，对循环使用print（表[1]），即表中的表：我需要第二列数据使用print（表[1]），由于第一列的原因，我们无法获得完整的数据。是否有任何其他方法仅获取第二列数据？这样，我们无法获取所有数据。是否希望只获取一列？否则，对循环使用print（表[1]），即表中的表：我需要第二列数据使用print（表[1]），由于第一列的原因，我们无法获得完整的数据。是否有任何其他方法仅获取第二列数据？这适用于BeautifulSoup 4.7.1这适用于BeautifulSoup 4.7.1