Python 从html中提取内但不包含的项_Python_Html_Web Scraping_Beautifulsoup

Python 从html中提取内但不包含的项

python html web-scraping

Python 从html中提取内但不包含的项,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我搜刮了一个网站，它为我提供了里斯本的邮政编码。有了BeautifulSoup，我可以在一个类项目中获得邮政编码。然而，邮政编码本身仍然在其他类中，我尝试了很多方法从中提取所有邮政编码。然而，除了字符串操作，我无法使它工作。我不熟悉网页垃圾和html，如果这个问题很基本，我很抱歉这是我的代码： from bs4 import BeautifulSoup as soup from requests import get url='https://worldpostalcode.com/port

我搜刮了一个网站，它为我提供了里斯本的邮政编码。有了BeautifulSoup，我可以在一个类项目中获得邮政编码。然而，邮政编码本身仍然在其他类中，我尝试了很多方法从中提取所有邮政编码。然而，除了字符串操作，我无法使它工作。我不熟悉网页垃圾和html，如果这个问题很基本，我很抱歉

这是我的代码：

from bs4 import BeautifulSoup as soup
from requests import get

url='https://worldpostalcode.com/portugal/lisboa/'
response = get(url)
print(response.text)
html_soup = soup(response.text,'lxml')
type(html_soup)
zip_codes=html_soup.find_all('div', {'class' : 'rightc'})

这是结果的一个片段，我只想从中提取邮政编码

[<div class="rightc">1000-246<hr/> 1050-138<hr/> 1069-188<hr/> 1070-204<hr/> 1100-069<hr/> 1100-329<hr/> 1100-591<hr/> 1150-144<hr/> 1169-062<hr/> 1170-128<hr/> 1170-395<hr/> 1200-228<hr/> 1200-604<hr/> 1200-862<hr/> 1250-111<hr/> 1269-121<hr/> 1300-217<hr/> 1300-492<hr/> 1350-092<hr/> 1399-014<hr/> 1400-237<hr/> 1500-061<hr/> 1500-360<hr/> 1500-674<hr/> 1600-232<hr/> 1600-643<hr/> 1700-018<hr/> 1700-302<hr/> 1750-113<hr/> 1750-464<hr/> 1800-262<hr/> 1900-115<hr/> 1900-401<hr/> 1950-208<hr/> 1990-162<hr/> 1000-247<hr/> 1050-139<hr/> 1069-190<hr/> 1070-205<hr/> 1100-070<hr/> 1100-330</div>]

您可以获取文本并将其拆分

o/p:

[u'1000-246', u'1050-138', u'1069-188', u'1070-204',.........]

您可以获取文本并将其拆分

o/p:

[u'1000-246', u'1050-138', u'1069-188', u'1070-204',.........]

使用正则表达式获取代码

from bs4 import BeautifulSoup
import requests
import re

url = 'https://worldpostalcode.com/portugal/lisboa/'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
element = soup.select_one('.codelist .rightc')
codes = re.findall(r"\d{4}-\d{3}",element.text)

for code in codes:
    print(code)

使用正则表达式获取代码

from bs4 import BeautifulSoup
import requests
import re

url = 'https://worldpostalcode.com/portugal/lisboa/'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
element = soup.select_one('.codelist .rightc')
codes = re.findall(r"\d{4}-\d{3}",element.text)

for code in codes:
    print(code)

您的结果压缩编码的类型为bs4.element.ResultSet，它是一组bs4.element.Tag。因此，您对找到的第一个标记感兴趣的是邮政编码[0]。使用.text方法剥离标记。现在您有了一长串由空格分隔的邮政编码。将它们列在下面的两个选项中，选项一更具python风格，速度更快

from bs4 import BeautifulSoup as soup
from requests import get

url = 'https://worldpostalcode.com/portugal/lisboa/'
response = get(url)
html_soup = soup(response.text,'lxml')
zip_codes = html_soup.find_all('div', {'class' : 'rightc'})

# option one
zips = zip_codes[0].text.split(' ')
print(zips[:8])

# option two (slower)
zips = []
for zc in zip_codes[0].childGenerator():
    zips.append(zc.extract().strip())
print(zips[:8])

输出：

['1000-246', '1050-138', '1069-188', '1070-204', '1100-069', '1100-329', '1100-591', '1150-144']
['1000-246', '1050-138', '1069-188', '1070-204', '1100-069', '1100-329', '1100-591', '1150-144']

from bs4 import BeautifulSoup as soup
from requests import get

url = 'https://worldpostalcode.com/portugal/lisboa/'
response = get(url)
html_soup = soup(response.text,'lxml')
zip_codes = html_soup.find_all('div', {'class' : 'rightc'})

# option one
zips = zip_codes[0].text.split(' ')
print(zips[:8])

# option two (slower)
zips = []
for zc in zip_codes[0].childGenerator():
    zips.append(zc.extract().strip())
print(zips[:8])

输出：

['1000-246', '1050-138', '1069-188', '1070-204', '1100-069', '1100-329', '1100-591', '1150-144']
['1000-246', '1050-138', '1069-188', '1070-204', '1100-069', '1100-329', '1100-591', '1150-144']

我建议您在将页面响应加载为soup之前，将所有标记替换为某个分隔符，即$or。现在，一旦您将其加载到soup中，您就可以通过调用类将邮政编码提取为列表，这项工作将变得非常简单

from bs4 import BeautifulSoup as soup
from requests import get

url='https://worldpostalcode.com/portugal/lisboa/'
response = get(url)
print(response.text.replace('<hr>', '#'))
html_soup = soup(response.text,'lxml')
type(html_soup)
zip_codes=html_soup.find_all('div', {'class' : 'rightc'})
zip_codes = zip_codes.text.split('#')

希望这有帮助！干杯

备注：答案有待改进和评论。

from bs4 import BeautifulSoup as soup
from requests import get

url='https://worldpostalcode.com/portugal/lisboa/'
response = get(url)
print(response.text.replace('<hr>', '#'))
html_soup = soup(response.text,'lxml')
type(html_soup)
zip_codes=html_soup.find_all('div', {'class' : 'rightc'})
zip_codes = zip_codes.text.split('#')

希望这有帮助！干杯

注：答案有待改进和评论