Python 刮擦递归刮擦网站

Python 刮擦递归刮擦网站,python,scrapy,web-crawler,scrapy-spider,Python,Scrapy,Web Crawler,Scrapy Spider,我想写一个scraper来访问初始页面的所有子页面 示例网站是:pydro.com 例如,它还应该提取pydro.com/impressum并将其保存为硬盘上的html文件 我写的代码是: import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.exporters import CsvItemExporter fro

我想写一个scraper来访问初始页面的所有子页面

示例网站是:pydro.com 例如,它还应该提取pydro.com/impressum并将其保存为硬盘上的html文件

我写的代码是:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exporters import CsvItemExporter
from scrapy.loader import ItemLoader
from finalproject.items import FinalprojectItem


class ExampleSpider(CrawlSpider):
    name = "projects"  # Spider name
    allowed_domains = ["pydro.com"]  # Which (sub-)domains shall be scraped?
    start_urls = ["https://pydro.com/"]  # Start with this one
    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]  # Follow any link scrapy finds (that is allowed).

    def parse_item(self, response):
        print('Got a response from %s.' % response.url)
        self.logger.info('Hi this is an item page! %s', response.url)
        page = response.url.split('.com/')[-1]
        filename = 'pydro.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)
当我运行spider时,输出仅为pydro.html

我想我需要调整我的文件名,以获得子页面。还是我需要一个for循环

编辑1: 我编辑代码以获取所有html页面。但是,当我想在另一个网站上运行脚本时,会出现一个错误,名为:

FileNotFoundError: [Errno 2] No such file or directory: 'otego-https://www.otego.de/de/jobs.php'
这就是我运行的脚本:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exporters import CsvItemExporter
from scrapy.loader import ItemLoader

class ExampleSpider(CrawlSpider):
    name = "otego" #Spider name
    allowed_domains = ["otego.de"] # Which (sub-)domains shall be scraped?
    start_urls = ["https://www.otego.de/en/index.php"] # Start with this one
    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)] # Follow any link scrapy finds (that is allowed).

def parse_item(self, response):
    print('Got a response from %s.' % response.url)
    self.logger.info('Hi this is an item page! %s', response.url)
    page = response.url
    filename = 'otego-%s' % page
    with open(filename, 'wb') as f:
        f.write(response.body)
    self.log('Saved file %s' % filename)

您需要创建一个递归刮片。 “子页面”只是另一个页面,其url是从“上一个”页面获得的。您必须向子页面发出第二个请求,子页面的url应位于变量sel中,并在第二个响应中使用xpath


嗯,我想当我使用爬行蜘蛛时,我会递归地爬行。这就是为什么我添加了规则,我猜?我稍微调整了一下代码:page=response.url.split.com/'[-1]filename='pydro-%s.html''%page我看到了url,因为每个url都以.com/结尾,我在.com/之后拆分它并粘贴文件名后面的内容。现在我的硬盘上有了所有的html。现在我想把它应用到另一个网站上。不起作用。。。。