Python 如何在beautifulsoup4中根据图像中的内容分离图像链接
我是新来的BeautifulSoup4,我试图从一个网站获取所有图片链接,例如Unsplash,但我只想在url中包含单词“photo”的url,例如 我不希望URL包含单词“个人资料”,例如Python 如何在beautifulsoup4中根据图像中的内容分离图像链接,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我是新来的BeautifulSoup4,我试图从一个网站获取所有图片链接,例如Unsplash,但我只想在url中包含单词“photo”的url,例如 我不希望URL包含单词“个人资料”,例如 我正在使用Pyhton 3.6和urllib3。您可以使用此脚本作为示例,如何筛选链接: import requests from bs4 import BeautifulSoup url = 'https://unsplash.com' soup = BeautifulSoup(requests
我正在使用Pyhton 3.6和urllib3。您可以使用此脚本作为示例,如何筛选链接:
import requests
from bs4 import BeautifulSoup
url = 'https://unsplash.com'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for img in soup.find_all('img'):
if 'photo' in img['src']: # print only links with `photo` inside them
print(img['src'])
印刷品:
https://images.unsplash.com/photo-1597649260558-e2bd7d35f043?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format%2Ccompress&fit=crop&w=1000&h=1000
https://images.unsplash.com/photo-1598929214025-d6bb6167d43b?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599567513879-604247ea2bd3?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599366611308-719895c34512?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1598929214025-d6bb6167d43b?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599366611308-719895c34512?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599567513879-604247ea2bd3?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1598929214025-d6bb6167d43b?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599567513879-604247ea2bd3?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599366611308-719895c34512?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
使用
urllib
:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://unsplash.com'
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html.parser')
for img in soup.find_all('img'):
if 'photo' in img['src']:
print(img['src'])
你可以简单地把它们全部取出来,然后过滤掉代码中不需要的部分。你能用代码添加一个例子吗,因为我对它不熟悉。Andrej Kesely说了很多