Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/xslt/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scrapy 未激发刮屑请求回调_Scrapy_Rules - Fatal编程技术网

Scrapy 未激发刮屑请求回调

Scrapy 未激发刮屑请求回调,scrapy,rules,Scrapy,Rules,我在玩从亚马逊抓取信息的游戏,但这让我很难受。到目前为止,我的蜘蛛看起来像这样: class AmzCrawlerSpider(CrawlSpider): name = 'amz_crawler' allowed_domains = ['amazon.com'] start_urls = ['http://www.amazon.com/s?ie=UTF8&bbn=1267449011&page=1&rh=n%3A284507%2Cn%3A10553

我在玩从亚马逊抓取信息的游戏,但这让我很难受。到目前为止,我的蜘蛛看起来像这样:

class AmzCrawlerSpider(CrawlSpider):
    name = 'amz_crawler'
    allowed_domains = ['amazon.com']
    start_urls = ['http://www.amazon.com/s?ie=UTF8&bbn=1267449011&page=1&rh=n%3A284507%2Cn%3A1055398%2Cn%3A%211063498%2Cn%3A1267449011%2Cn%3A3204211011']

    rules = (Rule(SgmlLinkExtractor(allow=r"page=\d+"), callback="parse_item", follow= True),)

    def parse_item(self, response):
        category_name = Selector(response).xpath('//*[@id="nav-subnav"]/a[1]/text()').extract()[0]
        products = Selector(response).xpath('//div[@class="s-item-container"]')

        for product in products:
            item = AmzItem()
            item['title'] = product.xpath('.//a[contains(@class, "s-access-detail-page")]/@title').extract()[0]
            item['url'] = product.xpath('.//a[contains(@class, "s-access-detail-page")]/@href').extract()[0]
            request = scrapy.Request(item['url'], callback=self.parse_product)
            request.meta['item'] = item
            print "Crawl " + item["title"]
            print "Crawl " + item['url']
            yield request

    def parse_product(self, response):
        print ( "Parse Product" )
        item = response.meta['item']
        sel = Selector(response)
        item['asin'] = sel.xpath('//td[@class="bucket"]/div/ul/li/b[contains(text(),"ASIN:")]/../text()').extract()[0]

        return item
有两个问题我似乎不明白: “Parse Product”从未打印过-因此我假设Parse_Product方法从未执行过,即使使用Crawl。。。展示得很好。 也许是关于规则

然后与规则相关: 它只适用于类别的第一页。爬虫不跟随类别第二页的链接。 我假设Scrapy的页面是以与浏览器不同的方式生成的?在控制台中,我看到很多301重定向:

2015-06-30 14:57:24+0800[amz_crawler]调试:重定向(301)到http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011>从http://www.amazon.com/gp/search/ref=sr_pg_1?sf=qz&fst=as%3Aoff&rh=n%3A2619533011%2Ck%3Apet+供应品%2Cp\u n\u日期\u首件可用\u绝对值%3A2661609011%2Cp\u 72%3A2661618011&sort=date desc rank&keywords=pet+supplies&ie=UTF8&qid=1435312739>
2015-06-30 14:57:29+0800[amz_crawler]调试:已爬网(200)http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011>(参考:无)
2015-06-30 14:57:39+0800[amz_crawler]调试:已爬网(200)http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Cp_72%3A2661618011>(参考资料:)
爬行2015-06-30 14:57:39珍稀猫超高级丛生猫砂,40磅袋
爬行


我做错了什么?

我刚刚试着运行它(我刚刚用
inspect\u response
替换了print,它暂停了工作,让你检查响应(比如
scrapy shell
),它可以工作。我的代码:这很奇怪-只是复制了你的代码并尝试了它,但是
inspect\u response(response,self)print(“解析产品”)
即使在几分钟后也不会执行。
打印“爬网”+
经常被执行。我有Scrapy 0.24.6和Python 2.7.6-你呢?我也在运行相同的版本,虽然这并不重要。可能一些项目设置正在修改你的结果?尝试
Scrapy startproject
,然后在新项目中重新创建spider。是的..似乎正在使用新项目爬虫-但不完全像我预期的那样(我猜是错误的规则)-我会继续挖掘,看看是否能找到问题的根源。谢谢