Python从Beautifulsoup网页刮取中排除某些图像路径
我创建了以下python脚本以从指定url提取图像src路径:Python从Beautifulsoup网页刮取中排除某些图像路径,python,beautifulsoup,python-requests,script,Python,Beautifulsoup,Python Requests,Script,我创建了以下python脚本以从指定url提取图像src路径: from requests_html import HTMLSession from urllib.request import urlopen from bs4 import BeautifulSoup import requests url="https://www.example.com/" session = HTMLSession() r = session.get(url) b = reque
from requests_html import HTMLSession
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
url="https://www.example.com/"
session = HTMLSession()
r = session.get(url)
b = requests.get(url)
soup = BeautifulSoup(b.text, "lxml")
images = soup.find_all('img')
for img in images:
if img.has_attr('src'):
print(img['src'])
脚本工作正常,但我们使用CDN,因此某些图像路径类似:
https://i2.wp.com/www.example.com/wp-content/uploads/2020/06/image-name.png?fit=250%2C250&ssl=1
因此,我希望能够排除某些以开头的图像src路径(可能是regex),例如:
这可能吗
谢谢非常简单:
url="https://www.example.com/"
exclude=".*https://i2.wp.com"
images = soup.find_all('img')
for img in images:
if img.has_attr('src') and not img["src"].startswith(exclude):
print(img['src'])
简单到:
url="https://www.example.com/"
exclude=".*https://i2.wp.com"
images = soup.find_all('img')
for img in images:
if img.has_attr('src') and not img["src"].startswith(exclude):
print(img['src'])
您可以使用regex,或者只需使用所述的
.startswith()
。然后在for循环中,如果以该循环开头,则继续。这意味着代码将在此停止并转到迭代中的下一项:
url="https://www.example.com/"
exclude="https://i2.wp.com"
images = soup.find_all('img')
for img in images:
if img.has_attr('src'):
if img['src'].startswith(exclude):
continue
print(img['src'])
您可以使用regex,或者只需使用所述的
.startswith()
。然后在for循环中,如果以该循环开头,则继续。这意味着代码将在此停止并转到迭代中的下一项:
url="https://www.example.com/"
exclude="https://i2.wp.com"
images = soup.find_all('img')
for img in images:
if img.has_attr('src'):
if img['src'].startswith(exclude):
continue
print(img['src'])
您可以使用attribute=value选择器将各种需求(img具有src,但不具有以特定字符串开头的irc)绑定到一行代码中(无循环):
您可以使用attribute=value选择器将各种需求(img具有src,但不具有以特定字符串开头的irc)绑定到一行代码中(无循环): 它的拼写是,例如,
if-img['src']https://i2.wp.com/“”:…
它是拼写的,例如,如果img['src'].startswith(“”https://i2.wp.com/):…