Python：使用BeautifulSoup4的简单Web爬虫程序_Python_Web_Beautifulsoup

Python：使用BeautifulSoup4的简单Web爬虫程序

python web

Python：使用BeautifulSoup4的简单Web爬虫程序,python,web,beautifulsoup,Python,Web,Beautifulsoup,我一直在关注Newboston的Python3.4教程，该教程使用Pycharm，目前正在学习如何创建web爬虫。我只是想下载XKCD的所有漫画。使用似乎非常简单的归档文件。这里是，后面是。每当我运行代码时，都不会发生任何事情。它运行了一遍，说，“进程结束，退出代码为0”，我在哪里搞砸了？ Newboston的教程有点过时，用于爬网的网站已经改变了域名。我将评论视频中重要的部分我的代码： mport requests from urllib import request from bs4

我一直在关注Newboston的Python3.4教程，该教程使用Pycharm，目前正在学习如何创建web爬虫。我只是想下载XKCD的所有漫画。使用似乎非常简单的归档文件。这里是，后面是。每当我运行代码时，都不会发生任何事情。它运行了一遍，说，“进程结束，退出代码为0”，我在哪里搞砸了？
Newboston的教程有点过时，用于爬网的网站已经改变了域名。我将评论视频中重要的部分

我的代码：

mport requests
from urllib import request
from bs4 import BeautifulSoup

def download_img(image_url, page):
    name = str(page) + ".jpg"
    request.urlretrieve(image_url, name)


def xkcd_spirder(max_pages):
    page = 1
    while page <= max_pages:
        url = r'http://xkcd.com/' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll('div', {'img': 'src'}):
            href = link.get('href')
            print(href)
            download_img(href, page)
        page += 1

xkcd_spirder(5)

mport请求
从urllib导入请求
从bs4导入BeautifulSoup
def下载\u img（图像\u url，页面）：
name=str（第页）+“.jpg”
request.urlretrieve（图像\ url，名称）
def xkcd_spirder（最大页数）：
页码=1
当page漫画在id为漫画的div中时，您只需从该div中的img中提取src，然后将其连接到基本url，最后请求内容并写入，我使用basename作为名称保存文件
我还将while替换为range循环，并仅使用请求执行所有http请求：
import requests
from bs4 import BeautifulSoup
from os import path
from urllib.parse import urljoin # python2 -> from urlparse import urljoin 


def download_img(image_url, base):
     # path.basename(image_url) 
    #  http://imgs.xkcd.com/comics/tree_cropped_(1).jpg -> tree_cropped_(1).jpg -
    with open(path.basename(image_url), "wb") as f:
        # image_url is a releative path, we have to join to the base 
        f.write(requests.get(urljoin(base,image_url)).content)


def xkcd_spirder(max_pages):
    base = "http://xkcd.com/"
    for page in range(1, max_pages + 1):
        url = base + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        # we only want one image
        img = soup.select_one("#comic img") # or .find('div',id= 'comic').img
        download_img(img["src"], base)

xkcd_spirder(5)

运行代码后，您将看到前五部漫画。
视频教程。你没有解释出哪里出了问题，或者问了个问题。。。？不过，我猜你在download\u img
中得到一个错误，说名称“page”未定义
——因为page
只存在于xkcd\u spirder
中，你不能从其他地方使用它。您需要将其作为参数传递到download\u img
。我刚刚尝试了我认为您的意思。还是很新的。这是新代码。和以前一样的问题。也编辑了原始的文章，实际上有我的问题。有点激动了，哈哈。