Scrapy 爬网整个网站,除了特定路径下的链接

Scrapy 爬网整个网站,除了特定路径下的链接,scrapy,scrapy-spider,Scrapy,Scrapy Spider,我有一只好斗的蜘蛛: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor class ExampleSpider(CrawlSpider): name = "spidermaster" allowed_domains = ["www.test.com"] start_urls = ["ht

我有一只好斗的蜘蛛:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class ExampleSpider(CrawlSpider):
    name = "spidermaster"
    allowed_domains = ["www.test.com"]
    start_urls = ["http://www.test.com/"]
    rules = [Rule(SgmlLinkExtractor(allow=()),
                  follow=True),
             Rule(SgmlLinkExtractor(allow=()), callback='parse_item'),
    ]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)
我尝试的是抓取整个网页,除了特定路径下的内容

例如,我想抓取除www.test.com/too_much_链接之外的所有测试网站


提前感谢

我通常这样做:

ignore = ['too_much_links', 'many_links']

rules = [Rule(SgmlLinkExtractor(allow=(), deny=ignore), follow=True),
         Rule(SgmlLinkExtractor(allow=(), deny=ignore), callback='parse_item'),
]

到目前为止你试过什么?举一些你尝试过的例子,或者如果你在让蜘蛛做你想做的事情的过程中误解了什么,详细说明你认为误解的根源在哪里。