Python 如何使用Scrapy递归地从站点中刮取每个链接？_Python_Web Scraping_Scrapy

Python 如何使用Scrapy递归地从站点中刮取每个链接？

python web-scraping scrapy

Python 如何使用Scrapy递归地从站点中刮取每个链接？,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我正在尝试使用Scrapy从网站获取每一个链接（没有其他数据）。我想这样做，从主页开始，从那里抓取所有的链接，然后对于每个找到的链接，跟随链接，从该页面抓取所有（唯一）链接，并对所有找到的链接执行此操作，直到没有更多的链接可以跟随我还必须输入用户名和密码才能进入网站的每个页面，因此我在start_请求中包含了一个基本的身份验证组件到目前为止，我有一个蜘蛛，它只给我主页上的链接，但我似乎不明白为什么它不跟随链接并抓取其他页面这是我的蜘蛛： from examplesite.items

我正在尝试使用Scrapy从网站获取每一个链接（没有其他数据）。我想这样做，从主页开始，从那里抓取所有的链接，然后对于每个找到的链接，跟随链接，从该页面抓取所有（唯一）链接，并对所有找到的链接执行此操作，直到没有更多的链接可以跟随

我还必须输入用户名和密码才能进入网站的每个页面，因此我在start_请求中包含了一个基本的身份验证组件

到目前为止，我有一个蜘蛛，它只给我主页上的链接，但我似乎不明白为什么它不跟随链接并抓取其他页面

这是我的蜘蛛：

    from examplesite.items import ExamplesiteItem
    import scrapy
    from scrapy.linkextractor import LinkExtractor
    from scrapy.spiders import Rule, CrawlSpider
    from scrapy import Request
    from w3lib.http import basic_auth_header
    from scrapy.crawler import CrawlerProcess

    class ExampleSpider(CrawlSpider):
#name of crawler
name = "examplesite"

#only scrape on pages within the example.co.uk domain
allowed_domains = ["example.co.uk"]

#start scraping on the site homepage once credentials have been authenticated
def start_requests(self):
    url = str("https://example.co.uk")
    username = "*********"
    password = "*********"
    auth = basic_auth_header(username, password)
    yield scrapy.Request(url=url,headers={'Authorization': auth})

#rules for recursively scraping the URLS found
rules = [
    Rule(
        LinkExtractor(
            canonicalize=True,
            unique=True
        ),
        follow=True,
        callback="parse"
    )
]

#method to identify hyperlinks by xpath and extract hyperlinks as scrapy items
def parse(self, response):
    for element in response.xpath('//a'):
        item = ExamplesiteItem()
        oglink = element.xpath('@href').extract()
        #need to add on prefix as some hrefs are not full https URLs and thus cannot be followed for scraping
        if "http" not in str(oglink):
            item['link'] = "https://example.co.uk" + oglink[0]
        else:
            item['link'] = oglink

        yield item

这是我的项目类：

    from scrapy import Field, Item

    class ExamplesiteItem(Item):
        link = Field()

我认为我错的地方是“规则”，我知道你需要遵循这些链接，但我不完全理解它是如何工作的（我试过在网上阅读一些解释，但仍然不确定）

任何帮助都将不胜感激

您的规则很好，问题在于重写

parse

方法

从位于

编写爬网爬行器规则时，避免使用

parse

作为回调，因为

CrawlSpider

使用

parse

方法本身来实现它的逻辑。因此，如果覆盖

parse

方法，爬行爬行器将不再有效

上述代码是否有效？