Python 使用scrapy获取404错误的所有实例_Python_Hyperlink_Scrapy_Duplicates_Http Status Code 404

Python 使用scrapy获取404错误的所有实例

python hyperlink scrapy

Python 使用scrapy获取404错误的所有实例,python,hyperlink,scrapy,duplicates,http-status-code-404,Python,Hyperlink,Scrapy,Duplicates,Http Status Code 404,我让Scrapy在我的网站上爬行，找到带有404响应的链接并将其返回到JSON文件。这真的很有效但是，我不知道如何获取该坏链接的所有实例，因为复制筛选器正在捕获这些链接，而不是重试它们由于我们的网站有数千个页面，这些部分由多个团队管理，因此我需要能够创建每个部分的坏链接报告，而不是查找一个并在整个网站上搜索替换任何帮助都将不胜感激我当前的爬虫程序： import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy

我让Scrapy在我的网站上爬行，找到带有404响应的链接并将其返回到JSON文件。这真的很有效

但是，我不知道如何获取该坏链接的所有实例，因为复制筛选器正在捕获这些链接，而不是重试它们

由于我们的网站有数千个页面，这些部分由多个团队管理，因此我需要能够创建每个部分的坏链接报告，而不是查找一个并在整个网站上搜索替换

任何帮助都将不胜感激

我当前的爬虫程序：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

# Add Items for exporting to JSON
class DevelopersLinkItem(Item):
    url = Field()
    referer = Field()
    link_text = Field()
    status = Field()
    time = Field()

class DevelopersSpider(CrawlSpider):
    """Subclasses Crawlspider to crawl the given site and parses each link to JSON"""

    # Spider name to be used when calling from the terminal
    name = "developers_prod"

    # Allow only the given host name(s)
    allowed_domains = ["example.com"]

    # Start crawling from this URL
    start_urls = ["https://example.com"]

    # Which status should be reported
    handle_httpstatus_list = [404]

    # Rules on how to extract links from the DOM, which URLS to deny, and gives a callback if needed
    rules = (Rule(LxmlLinkExtractor(deny=([
        '/android/'])), callback='parse_item', follow=True),)

    # Called back to for each requested page and used for parsing the response
    def parse_item(self, response):
        if response.status == 404:
            item = DevelopersLinkItem()
            item['url'] = response.url
            item['referer'] = response.request.headers.get('Referer')
            item['link_text'] = response.meta.get('link_text')
            item['status'] = response.status
            item['time'] = self.now.strftime("%Y-%m-%d %H:%M")

            return item

我尝试了一些定制的复制过滤器，但最终没有一个起作用

如果我正确理解您的问题，爬行蜘蛛会默认过滤您的请求。您可以使用规则类的process\u request参数为每个请求设置don\u filter=True（）

如果我正确理解您的问题，爬行爬行器将默认过滤您的请求。您可以使用规则类的process\u request参数为每个请求（）设置dont\u filter=True