Python Scrapy-获取正在分析的项的索引？_Python_Database_Xpath_Web Scraping_Scrapy

Python Scrapy-获取正在分析的项的索引？

python database xpath web-scraping scrapy

Python Scrapy-获取正在分析的项的索引？,python,database,xpath,web-scraping,scrapy,Python,Database,Xpath,Web Scraping,Scrapy,我正在尝试使用Scrapy从数据库加载一些XPATH规则到目前为止，我编写的代码运行良好，但是经过一些调试后，我意识到Scrapy正在异步解析每个项，这意味着我无法控制解析哪个项的顺序我想做的是，当列表中的哪个项目点击parse（）函数时，找出它当前正在被解析，这样我就可以将该索引引用到数据库中的行，并获取正确的XPATH查询。我目前的做法是使用一个名为item\u index的变量，并在每次item迭代后递增它。现在我意识到这还不够，我希望有一些内部功能可以帮助我实现这一点有人知道跟踪此

我正在尝试使用Scrapy从数据库加载一些XPATH规则

到目前为止，我编写的代码运行良好，但是经过一些调试后，我意识到Scrapy正在异步解析每个项，这意味着我无法控制解析哪个项的顺序

我想做的是，当列表中的哪个项目点击

parse（）

函数时，找出它当前正在被解析，这样我就可以将该索引引用到数据库中的行，并获取正确的XPATH查询。我目前的做法是使用一个名为

item\u index

的变量，并在每次item迭代后递增它。现在我意识到这还不够，我希望有一些内部功能可以帮助我实现这一点

有人知道跟踪此事的正确方法吗？我查阅了文档，但找不到任何关于它的信息。我也看过了，但我似乎不知道URL列表是如何存储的

下面是我进一步解释问题的代码：

# -*- coding: utf-8 -*-

from scrapy.spider import Spider
from scrapy.selector import Selector

from dirbot.items import Product

from dirbot.database import DatabaseConnection

# Create a database connection object so we can execute queries
connection = DatabaseConnection()

class DmozSpider(Spider):
    name = "dmoz"
    start_urls = []
    item_index = 0

    # Query for all products sold by a merchant
    rows = connection.query("SELECT * FROM products_merchant WHERE 1=1")

    def start_requests(self):
        for row in self.rows:
            yield self.make_requests_from_url(row["product_url"])

    def parse(self, response):
        sel = Selector(response)
        item = Product()
        item['product_id'] = self.rows[self.item_index]['product_id']
        item['merchant_id'] = self.rows[self.item_index]['merchant_id']
        item['price'] = sel.xpath(self.rows[self.item_index]['xpath_rule']).extract()

        self.item_index+=1

        return item

任何指导都将不胜感激

谢谢

您可以使用将索引（或数据库中的行id）与请求一起传递。它是一个字典，您可以从处理程序中访问它

例如，在生成请求时：

Request（url，callback=self.some\u handler，meta={'row\u id'：row['id']}）

像您尝试过的那样使用计数器是行不通的，因为您无法保证处理响应的顺序

这是我想出的解决方案，以防有人需要

正如@toothrot所建议的，您需要重载

Request

类中的方法，以便能够访问

meta

信息

希望这对别人有帮助

# -*- coding: utf-8 -*-

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request

from dirbot.items import Product

from dirbot.database import DatabaseConnection

# Create a database connection object so we can execute queries
connection = DatabaseConnection()

class DmozSpider(Spider):
    name = "dmoz"
    start_urls = []

    # Query for all products sold by a merchant
    rows = connection.query("SELECT * FROM products_merchant WHERE 1=1")

    def start_requests(self):
        for indx, row in enumerate(self.rows):
            self.start_urls.append( row["product_url"] )
            yield self.make_requests_from_url(row["product_url"], {'index': indx})

    def make_requests_from_url(self, url, meta):
       return Request(url, callback=self.parse, dont_filter=True, meta=meta)

    def parse(self, response):

        item_index = response.meta['index']

        sel = Selector(response)
        item = Product()
        item['product_id'] = self.rows[item_index]['product_id']
        item['merchant_id'] = self.rows[item_index]['merchant_id']
        item['price'] = sel.xpath(self.rows[item_index]['xpath_rule']).extract()

        return item

谢谢你的帮助！这引导我找到了解决办法：）当然。另外，您可能只想传递整个行对象本身，而不是索引（因为您在示例中只使用索引访问行对象）？