Python 如何避免响应中出现不需要的字段（scrapy）_Python_Scrapy_Web Crawler

Python 如何避免响应中出现不需要的字段（scrapy）

python scrapy web-crawler

Python 如何避免响应中出现不需要的字段（scrapy）,python,scrapy,web-crawler,Python,Scrapy,Web Crawler,大家好，提前感谢：当我运行scrapy时，我将项目放在.json中，但我得到的不是我想要的项目，而是一些垃圾：我知道这些不需要的数据随响应一起出现（第26行），但我想知道如何避免以我的json结尾的数据。请使用更明确的标题来帮助其他可能有同样问题的人；“垃圾”是一个非常模糊的词您可以在Scrapy文档中获得有关meta属性的更多信息包含此请求的任意元数据的dict。这条格言对于新请求为空，并且通常由不同的碎屑组件（扩展、中间件等）。那么数据呢此dict中包含的内容取决于您启用的扩

大家好，提前感谢：

当我运行scrapy时，我将项目放在.json中，但我得到的不是我想要的项目，而是一些垃圾：

我知道这些不需要的数据随响应一起出现（第26行），但我想知道如何避免以我的json结尾的数据。

请使用更明确的标题来帮助其他可能有同样问题的人；“垃圾”是一个非常模糊的词

您可以在Scrapy文档中获得有关

meta

属性的更多信息

包含此请求的任意元数据的dict。这条格言对于新请求为空，并且通常由不同的碎屑组件（扩展、中间件等）。那么数据呢此dict中包含的内容取决于您启用的扩展

如果希望避免在json中使用Scrapy填充所有这些信息，可以执行以下操作：

def parse(self, response):
  for tfg in response.css('li.row-fluid'):
    doc={}
    data = tfg.css('book-basics')
    doc['titulo'] = tfg.css('h2 a::text').extract_first()
    doc['url'] = response.urljoin(tfg.css('h2 a::attr(href)').extract_first())

    request = scrapy.Request(doc['url'], callback=self.parse_detail)
    request.meta['detail'] = doc
    yield request

  next = response.css('a.next::attr(href)').extract_first()
  if next is not None:
    next_page = response.urljoin(next)
    yield scrapy.Request(next_page, callback=self.parse)

def parse_detail(self, response):
  detail = response.meta['detail']
  detail['page_count'] = ' '.join(response.css('dd.page.count::text').extract())
  detail['keywords'] = ' '.join(response.css('div.descripcio a::text').extract())
  yield detail

def parse(self, response):
  for tfg in response.css('li.row-fluid'):
    doc={}
    data = tfg.css('book-basics')
    doc['titulo'] = tfg.css('h2 a::text').extract_first()
    doc['url'] = response.urljoin(tfg.css('h2 a::attr(href)').extract_first())

    request = scrapy.Request(doc['url'], callback=self.parse_detail)
    request.meta['detail'] = doc
    yield request

  next = response.css('a.next::attr(href)').extract_first()
  if next is not None:
    next_page = response.urljoin(next)
    yield scrapy.Request(next_page, callback=self.parse)

def parse_detail(self, response):
  detail = response.meta['detail']
  detail['page_count'] = ' '.join(response.css('dd.page.count::text').extract())
  detail['keywords'] = ' '.join(response.css('div.descripcio a::text').extract())
  yield detail