Python 美丽的汤-我如何刮包含特定src属性的图像？_Python_Html_Web Scraping_Beautifulsoup

Python 美丽的汤-我如何刮包含特定src属性的图像？

python html web-scraping

Python 美丽的汤-我如何刮包含特定src属性的图像？,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,几天前，我刚开始学习网络垃圾，我想尝试将Mangadex作为一个小型项目来使用会很有趣。提前谢谢你的建议我试图通过使用Beauty Soup 4和Python 3.7提取img标记的src属性来刮取图像我感兴趣的HTML部分是： <div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-s

几天前，我刚开始学习网络垃圾，我想尝试将Mangadex作为一个小型项目来使用会很有趣。提前谢谢你的建议

我试图通过使用Beauty Soup 4和Python 3.7提取img标记的src属性来刮取图像

我感兴趣的HTML部分是：

<div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
  <img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
</div>

我感兴趣的每个图像都包含一个以“”开头的src属性，所以我想也许我可以针对以该特定属性开头的图像

我曾尝试使用select（）来查找img元素，然后使用get（）来查找src，但对于特定的html部分没有任何运气

使用select（）和get（）工作的HTML部分包括：

attrs将列出该标记中设置的所有属性。它是一个字典，因此要获得特定的属性值，请参见下文

# for getting webpages
import requests
r = requests.get(URL_LINK)

base_url='https://s5.mangadex.org/data/'
# for beautiful soup
from bs4 import BeautifulSoup
bs = BeautifulSoup(r.content)
imgs = bs.findAll('img')
for img in imgs:
    src = img.attrs['src']
    if not src.startswith(base_url):
        src = base_url+src
    print(src)

试试这个：

from bs4 import BeautifulSoup

html = """
      <div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
      <img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
      </div>
      <div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
      <img draggable="false" class="noselect nodrag cursor-pointer" src="https://s4.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
      </div>
       """
soup = BeautifulSoup(html)

for n in soup.find_all('img'):    
    if(n.get('src').startswith( 'https://s5.mangadex.org/data/')):
      print(n.get('src'))

您不能直接使用BeautifulSoup刮取mangadex。Mangadex在文档准备好后用javascript加载他们的图像。你从BeautifulSoup得到的是那个空文档。这就是它失败的原因。本网站介绍了如何抓取依赖javascript提供内容的网页：

你有代码吗？仅供参考，这是刮（刮，刮，刮）而不是刮
from bs4 import BeautifulSoup html = """ <div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;"> <img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg"> </div> <div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;"> <img draggable="false" class="noselect nodrag cursor-pointer" src="https://s4.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg"> </div> """ soup = BeautifulSoup(html) for n in soup.find_all('img'): if(n.get('src').startswith( 'https://s5.mangadex.org/data/')): print(n.get('src'))

https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg