Python 即使请求已排队，Scrapy也不会爬行？_Python_Scrapy

Python 即使请求已排队，Scrapy也不会爬行？

python scrapy

Python 即使请求已排队，Scrapy也不会爬行？,python,scrapy,Python,Scrapy,我目前有以下规则： # Matches all comments page under user overview, # http://lookbook.nu/user/50784-Adam-G/comments/ Rule(SgmlLinkExtractor(allow=('/user/\d+[^/]+/comments/?$'), deny=('\?locale=')), callback='parse_model_comments'), # http://lookbook.nu/u

我目前有以下规则：

# Matches all comments page under user overview,
# http://lookbook.nu/user/50784-Adam-G/comments/
Rule(SgmlLinkExtractor(allow=('/user/\d+[^/]+/comments/?$'), deny=('\?locale=')),
    callback='parse_model_comments'),
# http://lookbook.nu/user/50784-Adam-G/comments?page=2
Rule(SgmlLinkExtractor(allow=('/user/\d+[^/]+/comments\?page=\d+$'), deny=('\?locale=')),
    callback='parse_model_comments'),

在我的函数定义中

def parse_model_comments(self, response):
  log.msg("Inside parse_model_comments")
  hxs = HtmlXPathSelector(response)
  model_url = hxs.select('//div[@id="userheader"]/h1/a/@href').extract()[0]
  comments_hxs = hxs.select(
      '//div[@id="profile_comments"]/div[@id="comments"]/div[@class="comment"]')
  if comments_hxs:
    log.msg("Yielding next page." + LookbookSpider.next_page(response.url))
    yield Request(LookbookSpider.next_page(response.url))

这是实际运行日志：

2012-11-26 18:52:46-0800 [lookbook] DEBUG: Crawled (200) <GET http://lookbook.nu/user/1363501-Rachael-Jane-H/comments> (referer: None)
2012-11-26 18:52:46-0800 [lookbook] DEBUG: Crawled (200) <GET http://lookbook.nu/user/1363501-Rachael-Jane-H/comments> (referer: http://lookbook.nu/user/1363501-Rachael-Jane-H/comments)
2012-11-26 18:52:46-0800 [scrapy] INFO: Inside parse_model_comments
2012-11-26 18:52:46-0800 [scrapy] INFO: Yielding next page.http://lookbook.nu/user/1363501-Rachael-Jane-H/comments?page=2
2012-11-26 18:52:46-0800 [lookbook] DEBUG: Scraped from <200 http://lookbook.nu/user/1363501-Rachael-Jane-H/comments>
    {'model_url': u'http://lookbook.nu/rachinald',
     'posted_at': u'2012-11-26T13:21:49-05:00',
     'target_url': u'http://lookbook.nu/look/4290423-Blackout-Challenge-One',
     'text': u"Thanks Justina :) They're actually purple - the whole premise is to not wear black all week ^^",
     'type': 2}
...
2012-11-26 18:52:47-0800 [lookbook] DEBUG: Crawled (200) <GET http://lookbook.nu/user/1363501-Rachael-Jane-H/comments?page=2> (referer: http://lookbook.nu/user/1363501-Rachael-Jane-H/comments)
2012-11-26 18:52:48-0800 [lookbook] INFO: Closing spider (finished)
2012-11-26 18:52:48-0800 [lookbook] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 2072,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'downloader/response_bytes': 51499,
     'downloader/response_count': 3,
     'downloader/response_status_count/200': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 11, 27, 2, 52, 48, 43058),
     'item_scraped_count': 14,
     'log_count/DEBUG': 23,
     'log_count/INFO': 6,
     'request_depth_max': 3,
     'response_received_count': 3,
     'scheduler/dequeued': 3,
     'scheduler/dequeued/memory': 3,
     'scheduler/enqueued': 3,
     'scheduler/enqueued/memory': 3,
     'start_time': datetime.datetime(2012, 11, 27, 2, 52, 44, 446851)}
2012-11-26 18:52:48-0800 [lookbook] INFO: Spider closed (finished)

2012-11-26 18:52:46-0800[lookbook]调试：已爬网（200）（参考：无）
2012-11-26 18:52:46-0800[lookbook]调试：爬网（200）（参考：http://lookbook.nu/user/1363501-Rachael-Jane-H/comments)
2012-11-26 18:52:46-0800[scrapy]信息：内部解析模型注释
2012-11-26 18:52:46-0800[scrapy]信息：下一页。http://lookbook.nu/user/1363501-Rachael-Jane-H/comments?page=2
2012-11-26 18:52:46-0800[lookbook]调试：从
{'model_url'：u'http://lookbook.nu/rachinald',
“发布地点”：u'2012-11-26T13:21:49-05:00，
'target_url'：u'http://lookbook.nu/look/4290423-Blackout-Challenge-One',
“文本”：u“谢谢贾斯汀娜：）它们实际上是紫色的-整个前提是整个星期都不要穿黑色的^^”，
“类型”：2}
...
2012-11-26 18:52:47-0800[lookbook]调试：爬网（200）（参考：http://lookbook.nu/user/1363501-Rachael-Jane-H/comments)
2012-11-26 18:52:48-0800[lookbook]信息：关闭蜘蛛（已完成）
2012-11-26 18:52:48-0800[参考资料]信息：倾销垃圾统计数据：
{'downloader/request_bytes'：2072，
“下载程序/请求计数”：3，
“下载程序/请求方法\计数/获取”：3，
“downloader/response_字节”：51499，
“下载程序/响应计数”：3，
“下载/响应状态\计数/200”：3，
“完成原因”：“完成”，
“完成时间”：datetime.datetime（2012,11,27,2,52,48,43058），
“物料刮取计数”：14，
“日志计数/调试”：23，
“日志计数/信息”：6，
“请求深度最大值”：3，
“收到的响应数”：3，
“调度程序/出列”：3，
“调度程序/出列/内存”：3，
“调度程序/排队”：3，
“调度程序/排队/内存”：3，
“开始时间”：datetime.datetime（2012,11,27,2,52,44,446851）}
2012-11-26 18:52:48-0800[lookbook]信息：蜘蛛关闭（完成）

即使对？page=2进行了爬网，也没有调用parse_model_注释，因为没有记录“Inside parse_model_comments”

我选中了

re.search（'/user/\d+[^/]+/comments\？page=\d+$，'http://lookbook.nu/user/1363501-Rachael-Jane-H/comments?page=2）

并确认它确实有效

知道为什么对page=2进行了爬网，但没有调用该函数吗

我知道这看起来很奇怪，但回调不应该是生成器（产生一些东西）

我建议：

def parse_model_comments(self, response):
   return list(_iter_parse_model_comments(self, response))

def _iter_parse_model_comments(self, response):
    # place your current code here

事实证明，在生成请求时，必须在爬行蜘蛛中手动指定回调

如果请求对象未设置回调，则将调用默认的parse（）

Crawlspider的parse（）函数只返回[]检查源代码