Python 创建具有多个解析的项目的碎片数组_Python_Arrays_Scrapy_Scrapy Spider

Python 创建具有多个解析的项目的碎片数组

python arrays scrapy

Python 创建具有多个解析的项目的碎片数组,python,arrays,scrapy,scrapy-spider,Python,Arrays,Scrapy,Scrapy Spider,我正在用Scrapy删除列表。我的脚本首先使用parse_node解析列表URL，然后使用parse_listing解析每个列表，对于每个列表，它使用parse_agent解析列表的代理。我想创建一个数组，它通过清单和清单代理进行scrapy解析，并为每个新清单重置以下是我的解析脚本： def parse_node(self,response,node): yield Request('LISTING LINK',callback=self.parse_listing) def par

我正在用Scrapy删除列表。我的脚本首先使用

parse_node

解析列表URL，然后使用

parse_listing

解析每个列表，对于每个列表，它使用

parse_agent

解析列表的代理。我想创建一个数组，它通过清单和清单代理进行scrapy解析，并为每个新清单重置

以下是我的解析脚本：

 def parse_node(self,response,node):
  yield Request('LISTING LINK',callback=self.parse_listing)
 def parse_listing(self,response):
  yield response.xpath('//node[@id="ListingId"]/text()').extract_first()
  yield response.xpath('//node[@id="ListingTitle"]/text()').extract_first()
  for agent in string.split(response.xpath('//node[@id="Agents"]/text()').extract_first() or "",'^'):
   yield Request('AGENT LINK',callback=self.parse_agent)
 def parse_agent(self,response):
  yield response.xpath('//node[@id="AgentName"]/text()').extract_first()
  yield response.xpath('//node[@id="AgentEmail"]/text()').extract_first()

我希望通过解析清单得出以下结果：

{
 'id':123,
 'title':'Amazing Listing'
}

然后解析要添加到列表数组的\u代理：

{
 'id':123,
 'title':'Amazing Listing'
 'agent':[
  {
   'name':'jon doe',
   'email:'jon.doe@email.com'
  },
  {
   'name':'jane doe',
   'email:'jane.doe@email.com'
  }
 ]
}

如何从每个级别获得结果并构建一个数组？

从scrapy导入请求创建一个哈希和代理列表，并将请求中的数据附加到该列表中

from scrapy import requests

listing = { "title" : "amazing listing", "agents" : [ ] }

agentUrls = ["list", "of", "urls", "from", "scraped", "page"]

for agentUrl in agentUrls:
    agentPage = requests.get(agentUrl)
    agentTree = html.fromstring(page.content)
    name = agentTree.xpath('//node[@id="AgentName"]/text()').extract_first()
    email = agentTree.xpath('//node[@id="AgentEmail"]/text()').extract_first()
    agent = { "name" : name, "email": email }
    listings.agents.append(agent)

这有点复杂：
您需要从多个不同的URL组成一个项目

Scrapy允许您在请求的元属性中携带数据，以便您可以执行以下操作：

def parse_node(self,response,node):
    yield Request('LISTING LINK', callback=self.parse_listing)

def parse_listing(self,response):
    item = defaultdict(list)
    item['id'] = response.xpath('//node[@id="ListingId"]/text()').extract_first()
    item['title'] = response.xpath('//node[@id="ListingTitle"]/text()').extract_first()
    agent_urls = string.split(response.xpath('//node[@id="Agents"]/text()').extract_first() or "",'^')
    # find all agent urls and start with first one
    url = agent_urls.pop(0)
    # we want to go through agent urls one-by-one and update single item with agent data
    yield Request(url, callback=self.parse_agent, 
                  meta={'item': item, 'agent_urls' agent_urls})

def parse_agent(self,response):
    item = response.meta['item']  # retrieve item generated in previous request
    agent = dict() 
    agent['name'] = response.xpath('//node[@id="AgentName"]/text()').extract_first()
    agent['email'] =  response.xpath('//node[@id="AgentEmail"]/text()').extract_first()
    item['agents'].append(agent)
    # check if we have any more agent urls left
    agent_urls = response.meta['agent_urls']
    if not agent_urls:  # we crawled all of the agents!
        return item
    # if we do - crawl next agent and carry over our current item
    url = agent_urls.pop(0)
    yield Request(url, callback=self.parse_agent, 
                  meta={'item': item, 'agent_urls' agent_urls})

感谢您的回复，它很有效。我还没有找到别的办法来解决它。我希望Scrapy能够解决这个问题，并提供一个内置的行为。