For loop 我在scrapy中的循环没有按顺序运行

For loop 我在scrapy中的循环没有按顺序运行,for-loop,while-loop,range,scrapy,For Loop,While Loop,Range,Scrapy,我正在抓取一系列URL。代码正在运行,但scrapy没有按顺序解析URL。例如,尽管我试图解析url1、url2、…、url100,但scrapy解析url2、url10、url1……等等 它解析所有的url,但当一个特定的url不存在时(例如example.com/unit.aspx?b_id=10),Firefox会显示我之前请求的结果。因为我想确保没有重复的URL,所以我需要确保循环按顺序解析URL,而不是“随意” 我尝试了“范围(1101)中的n”和“while bID您可以在请求对象中

我正在抓取一系列URL。代码正在运行,但scrapy没有按顺序解析URL。例如,尽管我试图解析url1、url2、…、url100,但scrapy解析url2、url10、url1……等等

它解析所有的url,但当一个特定的url不存在时(例如example.com/unit.aspx?b_id=10),Firefox会显示我之前请求的结果。因为我想确保没有重复的URL,所以我需要确保循环按顺序解析URL,而不是“随意”


我尝试了“范围(1101)中的n”和“while bID您可以在请求对象中使用priority属性”。Scrapy保证默认情况下在DFO中对URL进行爬网。但它不能确保URL在解析回调中的访问顺序

您希望返回一个请求数组,从中弹出对象直到其为空,而不是生成请求对象

有关更多信息,请参见此处


你可以试试这样的东西。我不确定它是否适合使用,因为我还没有看到spider代码的其余部分,但现在您可以看到:

# create a list of urls to be parsed, in reverse order (so we can easily pop items off)
crawl_urls = ['https://www.example.com/units.aspx?b_id=%s' % n for n in xrange(99,1,-1)]

def check_login_response(self, response):
    """Check the response returned by a login request to see if we are successfully logged in.
    """
    if "Welcome!" in response.body:
        self.log("Successfully logged in. Let's start crawling!")
        print "Successfully logged in. Let's start crawling!"
        # Now the crawling can begin..
        self.initialized()
        return Request(url='https://www.example.com/units.aspx?b_id=1',dont_filter=True,callback=self.parse_add_tables,meta={'bID':1,'metaItems':[]})
    else:
        self.log("Something went wrong, we couldn't log in....Bad times :(")
        # Something went wrong, we couldn't log in, so nothing happens.

def parse_add_tables(self, response):
    # parsing code here
    if self.crawl_urls:
        next_url = self.crawl_urls.pop()
        return Request(url=next_url,dont_filter=True,callback=self.parse_add_tables,meta={'bID':int(next_url[-1:]),'metaItems':[]})

    return items

谢谢你的回答!我搜索了索引,但没有找到这篇文章。我是python和scrapy新手,所以我需要学习更多关于如何更改默认属性的知识。
# create a list of urls to be parsed, in reverse order (so we can easily pop items off)
crawl_urls = ['https://www.example.com/units.aspx?b_id=%s' % n for n in xrange(99,1,-1)]

def check_login_response(self, response):
    """Check the response returned by a login request to see if we are successfully logged in.
    """
    if "Welcome!" in response.body:
        self.log("Successfully logged in. Let's start crawling!")
        print "Successfully logged in. Let's start crawling!"
        # Now the crawling can begin..
        self.initialized()
        return Request(url='https://www.example.com/units.aspx?b_id=1',dont_filter=True,callback=self.parse_add_tables,meta={'bID':1,'metaItems':[]})
    else:
        self.log("Something went wrong, we couldn't log in....Bad times :(")
        # Something went wrong, we couldn't log in, so nothing happens.

def parse_add_tables(self, response):
    # parsing code here
    if self.crawl_urls:
        next_url = self.crawl_urls.pop()
        return Request(url=next_url,dont_filter=True,callback=self.parse_add_tables,meta={'bID':int(next_url[-1:]),'metaItems':[]})

    return items