Python 2.7 排除BeautifulSoup中不需要的base64链接
我编写了一个简单的图像刮板脚本,在大多数情况下都能正常工作。我遇到了一个网站,有一些不错的jpg壁纸,我想刮的链接。该脚本工作正常,但也会打印不需要的base64数据图像链接。如何排除这些base64链接 输出:Python 2.7 排除BeautifulSoup中不需要的base64链接,python-2.7,beautifulsoup,Python 2.7,Beautifulsoup,我编写了一个简单的图像刮板脚本,在大多数情况下都能正常工作。我遇到了一个网站,有一些不错的jpg壁纸,我想刮的链接。该脚本工作正常,但也会打印不需要的base64数据图像链接。如何排除这些base64链接 输出: https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/cloudy-ubuntu-mate.jpg data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKA
https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/cloudy-ubuntu-mate.jpg
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/ubuntu-feeling.jpg
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/two-gentlemen-in-car.jpg
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
更新。
谢谢你的帮助。因此,完整的代码将像这样下载所有图像。干杯:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/')
img_url = 'https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/'
soup = BeautifulSoup(r.content, 'lxml')
for link in soup.select('img[src$=".jpg"]'):
image = (link['src'])
image_name = (img_url + image).split('/')[-1]
print ('Downloading: {}'.format(image_name))
r2 = requests.get(image)
with open(image_name, 'wb') as f:
f.write(r2.content)
试试看。它会带给你想要的结果。我使用了。选择此处而不是。全部查找 或者,如果您喜欢使用.find_all执行相同的操作:
这很有效,谢谢你。那么,在这种情况下使用select而不是find_all呢?和“img[src$=.jpg]”?
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/')
img_url = 'https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/'
soup = BeautifulSoup(r.content, 'lxml')
for link in soup.select('img[src$=".jpg"]'):
image = (link['src'])
image_name = (img_url + image).split('/')[-1]
print ('Downloading: {}'.format(image_name))
r2 = requests.get(image)
with open(image_name, 'wb') as f:
f.write(r2.content)
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/')
soup = BeautifulSoup(r.content, 'lxml')
for link in soup.select('img[src$=".jpg"]'):
print(link['src'])
for link in soup.find_all('img'):
if ".jpg" in link['src']:
print(link['src'])
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
htmldata = urlopen('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/')
)
soup = BeautifulSoup(htmldata, 'html.parser')
result = soup.find_all('img' , src=re.compile(r".*?(?=jpeg|png|jpg)"))