Python 无法使用Beauty Soup下载所有图像_Python_Beautifulsoup

Python 无法使用Beauty Soup下载所有图像

python

Python 无法使用Beauty Soup下载所有图像,python,beautifulsoup,Python,Beautifulsoup,我不太熟悉数据报废，也无法使用BeautifulSoup下载图像我需要从网站下载所有图片。我正在使用下面的代码： import re import requests from bs4 import BeautifulSoup site = 'http://someurl.org/' response = requests.get(site) soup = BeautifulSoup(response.text, 'html.parser') # img_tags = soup.find

我不太熟悉数据报废，也无法使用BeautifulSoup下载图像

我需要从网站下载所有图片。我正在使用下面的代码：

import re
import requests
from bs4 import BeautifulSoup

site = 'http://someurl.org/'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')

# img_tags = soup.findAll('img')
img_tags = soup.findAll('img',{"src":True})

print('img_tags: ')
print(img_tags)

urls = [img['src'] for img in img_tags]

print('urls: ')
print(urls)

for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

但是，这会忽略页面上显示的所有具有类似以下html的图像：

<img data-bind="attr: { src: thumbURL() }" src="/assets/images/submissions/abfc-2345345234.thumb.png">

我假设这是因为data属性也包含字符串“src”，但我似乎无法理解它。

您需要使用selenium或一些可以运行javascript的工具。这是找到它之前的代码加载映像

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

site = 'http://phylopic.org/'
dr = webdriver.Chrome()

dr.get(site)
try:
    element = WebDriverWait(dr, 20, 0.5).until(
        EC.visibility_of_element_located((By.CLASS_NAME, "span1"))
    )
except:
    print("Wait a bit more")
    time.sleep(5)

text = dr.page_source
soup = BeautifulSoup(text,"lxml")
imgs = soup.find_all('img')
print(imgs)

dr.close()

第二个问题是如何将相对路径转换为绝对路径。在

HTML

上有几种类型的

当url为

http://someurl.org/somefd/somefd2

http://someurl.org/somefd/somefd2/picture.jpg

http://someurl.org/somefd/somefd2/images/picture.jpg

```
http://someurl.org/images/picture.jpg
```
```
http://someurl.org/somefd/picture.jpg
```

这是我将rp转换为ap的代码

import re

site = 'https://en.wikipedia.org/wiki/IMAGE'


def r2a(path,site=site):
    rp = re.findall(r"(/?\W{2}\/)+?",path)

    if path.find("http") == 0: 
        #full http url
        return path

    elif path.find("//") == 0: 
        #http url lack of http:
        return "http:" + path

    elif path.find("//") < 0 and path.find("/") == 0: 
        # located in the folder at the root of the current web
        site_root = re.findall("http.{3,4}[^/]+",site)
        return site_root[0] + path

    elif rp: 
        # located in the folder one level up from the current folder
        sitep = len(re.findall(r"([^/]+)+",site)) - 2 - len(rp)
        # raise error when sitep-len(rp)
        new_path = re.findall("(http.{4}[^/]+)(/[^/]+){%d}"%(sitep),site)
        return "{}/{}".format("".join(new_path[0]),path.replace( "".join(rp) , ""))

    else:
        #  located in the folder one level up from the current folder
        #  located in the same folder as the current page
        return "{}/{}".format(site,path)


assert "https://en.wikipedia.org/wiki/IMAGE/a.jpg" == r2a("a.jpg")
assert "https://en.wikipedia.org/wiki/IMAGE/unknow/a.jpg" == r2a("unknow/a.jpg")
assert "https://en.wikipedia.org/unknow/a.jpg" == r2a("/unknow/a.jpg")
assert "https://en.wikipedia.org/wiki/a.jpg" == r2a("../a.jpg")
assert "https://en.wikipedia.org/a.jpg" == r2a("../../a.jpg")
assert "https://en.wikipedia.org/wiki/IMAGE/a.jpg" == r2a("https://en.wikipedia.org/wiki/IMAGE/a.jpg")
assert "http://en.wikipedia.org/" == r2a("//en.wikipedia.org/")

重新导入
场地https://en.wikipedia.org/wiki/IMAGE'
def r2a（路径，站点=站点）：
rp=re.findall（r“（/？\W{2}\/）+？”，路径）
如果path.find（“http”）==0：
#完整http url
返回路径
elif path.find（“/”）==0:
#http url缺少http:
返回“http:+path”
elif path.find（“/”）小于0，path.find（“/”==0：
#位于当前网站根目录下的文件夹中
site_root=re.findall（“http.{3,4}[^/]+”，site）
返回站点\u根[0]+路径
elif rp：
#位于当前文件夹上一级的文件夹中
sitep=len（关于findall（r“（[^/]+）+”，site））-2-len（rp）
#sitep len（rp）时引发错误
new_path=re.findall（“（http.{4}[^/]+）（/[^/]+）{%d}”%（sitep），site）
返回“{}/{}”.format（“.join（new_path[0]），path.replace（“.join（rp），”））
其他：
#位于当前文件夹上一级的文件夹中
#位于与当前页面相同的文件夹中
返回“{}/{}”。格式（站点、路径）
断言“https://en.wikipedia.org/wiki/IMAGE/a.jpg“==r2a（“a.jpg”）
断言“https://en.wikipedia.org/wiki/IMAGE/unknow/a.jpg“==r2a（“未知/a.jpg”）
断言“https://en.wikipedia.org/unknow/a.jpg“==r2a（“/unknow/a.jpg”）
断言“https://en.wikipedia.org/wiki/a.jpg“==r2a（../a.jpg”）
断言“https://en.wikipedia.org/a.jpg“==r2a（“../../a.jpg”）
断言“https://en.wikipedia.org/wiki/IMAGE/a.jpg“==r2a（”https://en.wikipedia.org/wiki/IMAGE/a.jpg")
断言“http://en.wikipedia.org/==r2a（“//en.wikipedia.org/”）

如果这是实际的HTML代码，则其中有不平衡的引号。因此，您可能希望检索所有具有数据绑定值的img标记，并且必须提取正确的值。我刚刚编辑，我意外删除了引号，但这是我在检查页面时看到的。它走了多远？它打印的是

img_标签

，还是

url

？它打印的是img_标签和url，但完全忽略了我喜欢的图像，只打印页面中的社交媒体图标图像。在我看来，问题在于img_标签=soup.findAll（'img'，{“src”：True}）因为我无法从中获取图像标记及其相对或绝对路径。您可以提供一个示例url吗？我正在尝试从这里获取图标图像具有html：您无法获取它的原因，因为此img需要javascript来加载数据@USER2300867我正在寻找一个我回答的问题，它可以帮助我写出更好的答案