Python Scrapy：具有忽略扩展名的规则不'；行不通_Python_Web Scraping_Scrapy_Rules

Python Scrapy：具有忽略扩展名的规则不'；行不通

python web-scraping scrapy

Python Scrapy：具有忽略扩展名的规则不'；行不通,python,web-scraping,scrapy,rules,Python,Web Scraping,Scrapy,Rules,Scrapy仍然使用Scrapy框架中被忽略的扩展中提到的扩展保存链接。但是为什么呢在控制台上，深度为2的链接将被忽略： 020-12-07 15:39:46 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://www.th-koeln.de/weiterbildung/mit-medienkritik-gegen-fake-news-das-fakehunter-planspiel

Scrapy仍然使用Scrapy框架中被忽略的扩展中提到的扩展保存链接。但是为什么呢

在控制台上，深度为2的链接将被忽略：

    020-12-07 15:39:46 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://www.th-koeln.de/weiterbildung/mit-medienkritik-gegen-fake-news-das-fakehunter-planspiel-als-bibliotheksangebot-fuer-jugendliche_78973.php

2020-12-07 15:39:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.th-koeln.de/mam/downloads/deutsch/hochschule/profil/nachhaltige_hochschule/umwelterklarung2019_web.pdf>

class FullThSpider(scrapy.Spider):
name = 'full_th_spyder'
# allowed_domains = ['www.th-koeln.de']
start_urls = ['https://www.th-koeln.de/']
custom_settings = {
    'DEPTH_LIMIT': 2
}

rules = (Rule(LinkExtractor(deny_extensions=(IGNORED_EXTENSIONS)),
              follow=True), )



def parse(self, response):
    for link in response.css('a::attr(href)').extract():
         url = response.urljoin(link)
         if url.startswith('https://www.th-koeln.de'):
             yield response.follow(url, self.parse)

但PDF文件、PNG等。。仍然不要忽略和爬网：

    020-12-07 15:39:46 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://www.th-koeln.de/weiterbildung/mit-medienkritik-gegen-fake-news-das-fakehunter-planspiel-als-bibliotheksangebot-fuer-jugendliche_78973.php

2020-12-07 15:39:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.th-koeln.de/mam/downloads/deutsch/hochschule/profil/nachhaltige_hochschule/umwelterklarung2019_web.pdf>

class FullThSpider(scrapy.Spider):
name = 'full_th_spyder'
# allowed_domains = ['www.th-koeln.de']
start_urls = ['https://www.th-koeln.de/']
custom_settings = {
    'DEPTH_LIMIT': 2
}

rules = (Rule(LinkExtractor(deny_extensions=(IGNORED_EXTENSIONS)),
              follow=True), )



def parse(self, response):
    for link in response.css('a::attr(href)').extract():
         url = response.urljoin(link)
         if url.startswith('https://www.th-koeln.de'):
             yield response.follow(url, self.parse)

您的

parse

方法迭代响应中的URL并跟踪它们，而不考虑扩展名。您的

parse

方法迭代响应中的URL并跟踪它们，而不考虑扩展名。