如何使用python获得图像的完整链接？_Python_Beautifulsoup_Web Crawler

如何使用python获得图像的完整链接？

python web-crawler

如何使用python获得图像的完整链接？,python,beautifulsoup,web-crawler,Python,Beautifulsoup,Web Crawler,我试图使一个爬虫进入网页，并下载该网页上所有可用的图像。我的代码如下所示 import random import urllib.request import requests from bs4 import BeautifulSoup def get_images(url): code = requests.get(url) text = code.text soup = BeautifulSoup(text) for img in soup.findAll(

我试图使一个爬虫进入网页，并下载该网页上所有可用的图像。我的代码如下所示

import random
import urllib.request
import requests
from bs4 import BeautifulSoup

def get_images(url):
    code = requests.get(url)
    text = code.text
    soup = BeautifulSoup(text)
    for img in soup.findAll('img'):
        src = img.get('src')
        download_image(src)


def download_image(url):
    name = random.randrange(1, 100)
    image_name = str(name) + ".jpg"
    urllib.request.urlretrieve(url, image_name)

get_images("http://www.any_url.com/")

现在，许多图像的

src

标记中通常不包含完整的URL。现在，我的问题是如何获得图像的完整URL以便下载它们？

您图像的完整URL是您网页的主机名+src标记中的相对路径

e、 g

您网页的Url为

http://example.com/foo/bar.html

图像src标记为：

然后图像的绝对url将为

http://example.com/image/smiley.png

使用内置函数

urljoin（）

可以轻松执行此操作：

from urllib.parse import urljoin
webpage_url = 'http://example.com/foo/bar.html'
src = '/folder/big/a.jpg'
urljoin(webpage_url, src)