Scrapy 刮擦连接不同的项目,以提高产量
我在一个新闻网站。每一条新闻都有内容和许多评论。我有两个项目,一个用于内容,另一个用于多个评论。 问题是内容和多个注释会产生不同的请求。我希望新闻的内容和它的多个评论能够一起或作为一个评论产生或返回。管道时间或顺序对我来说并不重要 在项目文件中:Scrapy 刮擦连接不同的项目,以提高产量,scrapy,scrapy-pipeline,Scrapy,Scrapy Pipeline,我在一个新闻网站。每一条新闻都有内容和许多评论。我有两个项目,一个用于内容,另一个用于多个评论。 问题是内容和多个注释会产生不同的请求。我希望新闻的内容和它的多个评论能够一起或作为一个评论产生或返回。管道时间或顺序对我来说并不重要 在项目文件中: class NewsPageItem(scrapy.Item): title = scrapy.Field() date = scrapy.Field() hour = scrapy.Field() image = sc
class NewsPageItem(scrapy.Item):
title = scrapy.Field()
date = scrapy.Field()
hour = scrapy.Field()
image = scrapy.Field()
image_url = scrapy.Field()
top_content = scrapy.Field()
parag = scrapy.Field()
#comments = scrapy.Field()
comments_count = scrapy.Field()
class CommentsItem(scrapy.Item):
id_ = scrapy.Field()
username = scrapy.Field()
firstname = scrapy.Field()
lastname = scrapy.Field()
email = scrapy.Field()
ip = scrapy.Field()
userid = scrapy.Field()
date = scrapy.Field()
comment_text = scrapy.Field()
comment_type_id = scrapy.Field()
object_id = scrapy.Field()
yes = scrapy.Field()
no = scrapy.Field()
在Spider中,新闻内容及其许多评论没有关联:
class NewsSpider(scrapy.Spider):
...
def parse(self, response):
for nl in news_links:
yield scrapy.Request(url=nl, callback=self.new_parse)
yield scrapy.Request(url=url, callback=self.comment_parse)
def new_parse(self,response):
item = BigParaItem()
item['title'] = response.xpath(...).extract()
...
yield item
def comment_parse(self,response):
data = json.loads(response.body.decode('utf8'))
for comment in data.get('data', []):
item = CommentsItem()
item['id_'] = comment.get('Id')
...
yield item
管道:
class NewsPagePipeline(object):
def process_item(self, item, spider):
return item
class CommentsPipeline(object):
def process_item(self, item, spider):
return item
如何连接项目或在产生时嵌套?最好在回调之间链接请求并传递新闻项目,以便使用
meta
用注释填充它:
class NewsSpider(scrapy.Spider):
...
def parse(self, response):
for nl in news_links:
yield scrapy.Request(url=nl, callback=self.new_parse, meta={'comments_url': url})
def new_parse(self,response):
item = BigParaItem()
item['title'] = response.xpath(...).extract()
item['comments'] = []
...
yield scrapy.Request(response.meta['comments_url'], callback=self.comment_parse, meta={'item': item})
def comment_parse(self,response):
data = json.loads(response.body.decode('utf8'))
item = response.meta['item']
for comment in data.get('data', []):
c_item = CommentsItem()
c_item['id_'] = comment.get('Id')
...
item['comments'].append(c_item)
yield item