Python 2.7 排除BeautifulSoup中不需要的base64链接_Python 2.7_Beautifulsoup

Python 2.7 排除BeautifulSoup中不需要的base64链接

python-2.7

Python 2.7 排除BeautifulSoup中不需要的base64链接,python-2.7,beautifulsoup,Python 2.7,Beautifulsoup,我编写了一个简单的图像刮板脚本，在大多数情况下都能正常工作。我遇到了一个网站，有一些不错的jpg壁纸，我想刮的链接。该脚本工作正常，但也会打印不需要的base64数据图像链接。如何排除这些base64链接输出： https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/cloudy-ubuntu-mate.jpg data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKA

我编写了一个简单的图像刮板脚本，在大多数情况下都能正常工作。我遇到了一个网站，有一些不错的jpg壁纸，我想刮的链接。该脚本工作正常，但也会打印不需要的base64数据图像链接。如何排除这些base64链接

输出：

https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/cloudy-ubuntu-mate.jpg
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/ubuntu-feeling.jpg
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/two-gentlemen-in-car.jpg
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==

更新。谢谢你的帮助。因此，完整的代码将像这样下载所有图像。干杯：

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/')
img_url = 'https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/'
soup = BeautifulSoup(r.content, 'lxml')

for link in soup.select('img[src$=".jpg"]'):
    image = (link['src'])
    image_name = (img_url + image).split('/')[-1]
    print ('Downloading: {}'.format(image_name))
    r2 = requests.get(image)
    with open(image_name, 'wb') as f:
        f.write(r2.content)

试试看。它会带给你想要的结果。我使用了。选择此处而不是。全部查找

或者，如果您喜欢使用.find_all执行相同的操作：

这很有效，谢谢你。那么，在这种情况下使用select而不是find_all呢？和“img[src$=.jpg]”？

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/')
img_url = 'https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/'
soup = BeautifulSoup(r.content, 'lxml')

for link in soup.select('img[src$=".jpg"]'):
    image = (link['src'])
    image_name = (img_url + image).split('/')[-1]
    print ('Downloading: {}'.format(image_name))
    r2 = requests.get(image)
    with open(image_name, 'wb') as f:
        f.write(r2.content)

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/')
soup = BeautifulSoup(r.content, 'lxml')

for link in soup.select('img[src$=".jpg"]'):
    print(link['src'])

for link in soup.find_all('img'):
    if ".jpg" in link['src']:
        print(link['src'])

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
  
htmldata = urlopen('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/')
)

soup = BeautifulSoup(htmldata, 'html.parser')

result = soup.find_all('img' , src=re.compile(r".*?(?=jpeg|png|jpg)"))