Scrapy 未激发刮屑请求回调
我在玩从亚马逊抓取信息的游戏,但这让我很难受。到目前为止,我的蜘蛛看起来像这样:Scrapy 未激发刮屑请求回调,scrapy,rules,Scrapy,Rules,我在玩从亚马逊抓取信息的游戏,但这让我很难受。到目前为止,我的蜘蛛看起来像这样: class AmzCrawlerSpider(CrawlSpider): name = 'amz_crawler' allowed_domains = ['amazon.com'] start_urls = ['http://www.amazon.com/s?ie=UTF8&bbn=1267449011&page=1&rh=n%3A284507%2Cn%3A10553
class AmzCrawlerSpider(CrawlSpider):
name = 'amz_crawler'
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com/s?ie=UTF8&bbn=1267449011&page=1&rh=n%3A284507%2Cn%3A1055398%2Cn%3A%211063498%2Cn%3A1267449011%2Cn%3A3204211011']
rules = (Rule(SgmlLinkExtractor(allow=r"page=\d+"), callback="parse_item", follow= True),)
def parse_item(self, response):
category_name = Selector(response).xpath('//*[@id="nav-subnav"]/a[1]/text()').extract()[0]
products = Selector(response).xpath('//div[@class="s-item-container"]')
for product in products:
item = AmzItem()
item['title'] = product.xpath('.//a[contains(@class, "s-access-detail-page")]/@title').extract()[0]
item['url'] = product.xpath('.//a[contains(@class, "s-access-detail-page")]/@href').extract()[0]
request = scrapy.Request(item['url'], callback=self.parse_product)
request.meta['item'] = item
print "Crawl " + item["title"]
print "Crawl " + item['url']
yield request
def parse_product(self, response):
print ( "Parse Product" )
item = response.meta['item']
sel = Selector(response)
item['asin'] = sel.xpath('//td[@class="bucket"]/div/ul/li/b[contains(text(),"ASIN:")]/../text()').extract()[0]
return item
有两个问题我似乎不明白:
“Parse Product”从未打印过-因此我假设Parse_Product方法从未执行过,即使使用Crawl。。。展示得很好。
也许是关于规则
然后与规则相关:
它只适用于类别的第一页。爬虫不跟随类别第二页的链接。
我假设Scrapy的页面是以与浏览器不同的方式生成的?在控制台中,我看到很多301重定向:
2015-06-30 14:57:24+0800[amz_crawler]调试:重定向(301)到http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011>从http://www.amazon.com/gp/search/ref=sr_pg_1?sf=qz&fst=as%3Aoff&rh=n%3A2619533011%2Ck%3Apet+供应品%2Cp\u n\u日期\u首件可用\u绝对值%3A2661609011%2Cp\u 72%3A2661618011&sort=date desc rank&keywords=pet+supplies&ie=UTF8&qid=1435312739>2015-06-30 14:57:29+0800[amz_crawler]调试:已爬网(200)http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011>(参考:无)
2015-06-30 14:57:39+0800[amz_crawler]调试:已爬网(200)http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Cp_72%3A2661618011>(参考资料:)
爬行2015-06-30 14:57:39珍稀猫超高级丛生猫砂,40磅袋
爬行
我做错了什么?我刚刚试着运行它(我刚刚用
inspect\u response
替换了print,它暂停了工作,让你检查响应(比如scrapy shell
),它可以工作。我的代码:这很奇怪-只是复制了你的代码并尝试了它,但是inspect\u response(response,self)print(“解析产品”)
即使在几分钟后也不会执行。打印“爬网”+
经常被执行。我有Scrapy 0.24.6和Python 2.7.6-你呢?我也在运行相同的版本,虽然这并不重要。可能一些项目设置正在修改你的结果?尝试Scrapy startproject
,然后在新项目中重新创建spider。是的..似乎正在使用新项目爬虫-但不完全像我预期的那样(我猜是错误的规则)-我会继续挖掘,看看是否能找到问题的根源。谢谢