Scrapy 爬网整个网站,除了特定路径下的链接
我有一只好斗的蜘蛛:Scrapy 爬网整个网站,除了特定路径下的链接,scrapy,scrapy-spider,Scrapy,Scrapy Spider,我有一只好斗的蜘蛛: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor class ExampleSpider(CrawlSpider): name = "spidermaster" allowed_domains = ["www.test.com"] start_urls = ["ht
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class ExampleSpider(CrawlSpider):
name = "spidermaster"
allowed_domains = ["www.test.com"]
start_urls = ["http://www.test.com/"]
rules = [Rule(SgmlLinkExtractor(allow=()),
follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item'),
]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
我尝试的是抓取整个网页,除了特定路径下的内容
例如,我想抓取除www.test.com/too_much_链接之外的所有测试网站
提前感谢我通常这样做:
ignore = ['too_much_links', 'many_links']
rules = [Rule(SgmlLinkExtractor(allow=(), deny=ignore), follow=True),
Rule(SgmlLinkExtractor(allow=(), deny=ignore), callback='parse_item'),
]
到目前为止你试过什么?举一些你尝试过的例子,或者如果你在让蜘蛛做你想做的事情的过程中误解了什么,详细说明你认为误解的根源在哪里。