Python 美丽的汤-我如何刮包含特定src属性的图像?
几天前,我刚开始学习网络垃圾,我想尝试将Mangadex作为一个小型项目来使用会很有趣。提前谢谢你的建议 我试图通过使用Beauty Soup 4和Python 3.7提取img标记的src属性来刮取图像 我感兴趣的HTML部分是:Python 美丽的汤-我如何刮包含特定src属性的图像?,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,几天前,我刚开始学习网络垃圾,我想尝试将Mangadex作为一个小型项目来使用会很有趣。提前谢谢你的建议 我试图通过使用Beauty Soup 4和Python 3.7提取img标记的src属性来刮取图像 我感兴趣的HTML部分是: <div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-s
<div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
<img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
</div>
我感兴趣的每个图像都包含一个以“”开头的src属性,所以我想也许我可以针对以该特定属性开头的图像
我曾尝试使用select()来查找img元素,然后使用get()来查找src,但对于特定的html部分没有任何运气
使用select()和get()工作的HTML部分包括:
attrs将列出该标记中设置的所有属性。它是一个字典,因此要获得特定的属性值,请参见下文
# for getting webpages
import requests
r = requests.get(URL_LINK)
base_url='https://s5.mangadex.org/data/'
# for beautiful soup
from bs4 import BeautifulSoup
bs = BeautifulSoup(r.content)
imgs = bs.findAll('img')
for img in imgs:
src = img.attrs['src']
if not src.startswith(base_url):
src = base_url+src
print(src)
试试这个:
from bs4 import BeautifulSoup
html = """
<div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
<img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
</div>
<div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
<img draggable="false" class="noselect nodrag cursor-pointer" src="https://s4.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
</div>
"""
soup = BeautifulSoup(html)
for n in soup.find_all('img'):
if(n.get('src').startswith( 'https://s5.mangadex.org/data/')):
print(n.get('src'))
您不能直接使用BeautifulSoup刮取mangadex。Mangadex在文档准备好后用javascript加载他们的图像。你从BeautifulSoup得到的是那个空文档。这就是它失败的原因。本网站介绍了如何抓取依赖javascript提供内容的网页:
你有代码吗?仅供参考,这是刮(刮,刮,刮)而不是刮
from bs4 import BeautifulSoup
html = """
<div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
<img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
</div>
<div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
<img draggable="false" class="noselect nodrag cursor-pointer" src="https://s4.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
</div>
"""
soup = BeautifulSoup(html)
for n in soup.find_all('img'):
if(n.get('src').startswith( 'https://s5.mangadex.org/data/')):
print(n.get('src'))
https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg