Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/72.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 美丽的汤-我如何刮包含特定src属性的图像?_Python_Html_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 美丽的汤-我如何刮包含特定src属性的图像?

Python 美丽的汤-我如何刮包含特定src属性的图像?,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,几天前,我刚开始学习网络垃圾,我想尝试将Mangadex作为一个小型项目来使用会很有趣。提前谢谢你的建议 我试图通过使用Beauty Soup 4和Python 3.7提取img标记的src属性来刮取图像 我感兴趣的HTML部分是: <div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-s

几天前,我刚开始学习网络垃圾,我想尝试将Mangadex作为一个小型项目来使用会很有趣。提前谢谢你的建议

我试图通过使用Beauty Soup 4和Python 3.7提取img标记的src属性来刮取图像

我感兴趣的HTML部分是:

<div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
  <img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
</div>

我感兴趣的每个图像都包含一个以“”开头的src属性,所以我想也许我可以针对以该特定属性开头的图像

我曾尝试使用select()来查找img元素,然后使用get()来查找src,但对于特定的html部分没有任何运气

使用select()和get()工作的HTML部分包括:


attrs将列出该标记中设置的所有属性。它是一个字典,因此要获得特定的属性值,请参见下文

# for getting webpages
import requests
r = requests.get(URL_LINK)

base_url='https://s5.mangadex.org/data/'
# for beautiful soup
from bs4 import BeautifulSoup
bs = BeautifulSoup(r.content)
imgs = bs.findAll('img')
for img in imgs:
    src = img.attrs['src']
    if not src.startswith(base_url):
        src = base_url+src
    print(src)
试试这个:

from bs4 import BeautifulSoup

html = """
      <div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
      <img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
      </div>
      <div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
      <img draggable="false" class="noselect nodrag cursor-pointer" src="https://s4.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
      </div>
       """
soup = BeautifulSoup(html)

for n in soup.find_all('img'):    
    if(n.get('src').startswith( 'https://s5.mangadex.org/data/')):
      print(n.get('src'))

您不能直接使用BeautifulSoup刮取mangadex。Mangadex在文档准备好后用javascript加载他们的图像。你从BeautifulSoup得到的是那个空文档。这就是它失败的原因。本网站介绍了如何抓取依赖javascript提供内容的网页:


你有代码吗?仅供参考,这是刮(刮,刮,刮)而不是刮
from bs4 import BeautifulSoup

html = """
      <div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
      <img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
      </div>
      <div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
      <img draggable="false" class="noselect nodrag cursor-pointer" src="https://s4.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
      </div>
       """
soup = BeautifulSoup(html)

for n in soup.find_all('img'):    
    if(n.get('src').startswith( 'https://s5.mangadex.org/data/')):
      print(n.get('src'))
https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg