For loop 我在scrapy中的循环没有按顺序运行_For Loop_While Loop_Range_Scrapy

For loop 我在scrapy中的循环没有按顺序运行

for-loop scrapy

For loop 我在scrapy中的循环没有按顺序运行,for-loop,while-loop,range,scrapy,For Loop,While Loop,Range,Scrapy,我正在抓取一系列URL。代码正在运行，但scrapy没有按顺序解析URL。例如，尽管我试图解析url1、url2、…、url100，但scrapy解析url2、url10、url1……等等它解析所有的url，但当一个特定的url不存在时（例如example.com/unit.aspx？b_id=10），Firefox会显示我之前请求的结果。因为我想确保没有重复的URL，所以我需要确保循环按顺序解析URL，而不是“随意” 我尝试了“范围（1101）中的n”和“while bID您可以在请求对象中

我正在抓取一系列URL。代码正在运行，但scrapy没有按顺序解析URL。例如，尽管我试图解析url1、url2、…、url100，但scrapy解析url2、url10、url1……等等

它解析所有的url，但当一个特定的url不存在时（例如example.com/unit.aspx？b_id=10），Firefox会显示我之前请求的结果。因为我想确保没有重复的URL，所以我需要确保循环按顺序解析URL，而不是“随意”

我尝试了“范围（1101）中的n”和“while bID您可以在请求对象中使用priority属性”。Scrapy保证默认情况下在DFO中对URL进行爬网。但它不能确保URL在解析回调中的访问顺序

您希望返回一个请求数组，从中弹出对象直到其为空，而不是生成请求对象

有关更多信息，请参见此处

你可以试试这样的东西。我不确定它是否适合使用，因为我还没有看到spider代码的其余部分，但现在您可以看到：

# create a list of urls to be parsed, in reverse order (so we can easily pop items off)
crawl_urls = ['https://www.example.com/units.aspx?b_id=%s' % n for n in xrange(99,1,-1)]

def check_login_response(self, response):
    """Check the response returned by a login request to see if we are successfully logged in.
    """
    if "Welcome!" in response.body:
        self.log("Successfully logged in. Let's start crawling!")
        print "Successfully logged in. Let's start crawling!"
        # Now the crawling can begin..
        self.initialized()
        return Request(url='https://www.example.com/units.aspx?b_id=1',dont_filter=True,callback=self.parse_add_tables,meta={'bID':1,'metaItems':[]})
    else:
        self.log("Something went wrong, we couldn't log in....Bad times :(")
        # Something went wrong, we couldn't log in, so nothing happens.

def parse_add_tables(self, response):
    # parsing code here
    if self.crawl_urls:
        next_url = self.crawl_urls.pop()
        return Request(url=next_url,dont_filter=True,callback=self.parse_add_tables,meta={'bID':int(next_url[-1:]),'metaItems':[]})

    return items

谢谢你的回答！我搜索了索引，但没有找到这篇文章。我是python和scrapy新手，所以我需要学习更多关于如何更改默认属性的知识。

# create a list of urls to be parsed, in reverse order (so we can easily pop items off)
crawl_urls = ['https://www.example.com/units.aspx?b_id=%s' % n for n in xrange(99,1,-1)]

def check_login_response(self, response):
    """Check the response returned by a login request to see if we are successfully logged in.
    """
    if "Welcome!" in response.body:
        self.log("Successfully logged in. Let's start crawling!")
        print "Successfully logged in. Let's start crawling!"
        # Now the crawling can begin..
        self.initialized()
        return Request(url='https://www.example.com/units.aspx?b_id=1',dont_filter=True,callback=self.parse_add_tables,meta={'bID':1,'metaItems':[]})
    else:
        self.log("Something went wrong, we couldn't log in....Bad times :(")
        # Something went wrong, we couldn't log in, so nothing happens.

def parse_add_tables(self, response):
    # parsing code here
    if self.crawl_urls:
        next_url = self.crawl_urls.pop()
        return Request(url=next_url,dont_filter=True,callback=self.parse_add_tables,meta={'bID':int(next_url[-1:]),'metaItems':[]})

    return items