Python从Beautifulsoup网页刮取中排除某些图像路径

Python从Beautifulsoup网页刮取中排除某些图像路径,python,beautifulsoup,python-requests,script,Python,Beautifulsoup,Python Requests,Script,我创建了以下python脚本以从指定url提取图像src路径: from requests_html import HTMLSession from urllib.request import urlopen from bs4 import BeautifulSoup import requests url="https://www.example.com/" session = HTMLSession() r = session.get(url) b = reque

我创建了以下python脚本以从指定url提取图像src路径:

from requests_html import HTMLSession
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests

url="https://www.example.com/"

session = HTMLSession()
r = session.get(url)

b  = requests.get(url)
soup = BeautifulSoup(b.text, "lxml") 

images = soup.find_all('img')
for img in images:
    if img.has_attr('src'):
        print(img['src'])
脚本工作正常,但我们使用CDN,因此某些图像路径类似:

https://i2.wp.com/www.example.com/wp-content/uploads/2020/06/image-name.png?fit=250%2C250&ssl=1 
因此,我希望能够排除某些以开头的图像src路径(可能是regex),例如:

这可能吗

谢谢

非常简单:

url="https://www.example.com/"
exclude=".*https://i2.wp.com"

images = soup.find_all('img')
for img in images:
    if img.has_attr('src') and not img["src"].startswith(exclude):
        print(img['src'])
简单到:

url="https://www.example.com/"
exclude=".*https://i2.wp.com"

images = soup.find_all('img')
for img in images:
    if img.has_attr('src') and not img["src"].startswith(exclude):
        print(img['src'])

您可以使用regex,或者只需使用所述的
.startswith()
。然后在for循环中,如果以该循环开头,则继续。这意味着代码将在此停止并转到迭代中的下一项:

url="https://www.example.com/"
exclude="https://i2.wp.com"

images = soup.find_all('img')
for img in images:
    if img.has_attr('src'):
        if img['src'].startswith(exclude):
            continue
        print(img['src'])

您可以使用regex,或者只需使用所述的
.startswith()
。然后在for循环中,如果以该循环开头,则继续。这意味着代码将在此停止并转到迭代中的下一项:

url="https://www.example.com/"
exclude="https://i2.wp.com"

images = soup.find_all('img')
for img in images:
    if img.has_attr('src'):
        if img['src'].startswith(exclude):
            continue
        print(img['src'])

您可以使用attribute=value选择器将各种需求(img具有src,但不具有以特定字符串开头的irc)绑定到一行代码中(无循环):


您可以使用attribute=value选择器将各种需求(img具有src,但不具有以特定字符串开头的irc)绑定到一行代码中(无循环):

它的拼写是,例如,
if-img['src']https://i2.wp.com/“”:…
它是拼写的,例如,
如果img['src'].startswith(“”https://i2.wp.com/):…