For loop 我在scrapy中的循环没有按顺序运行
我正在抓取一系列URL。代码正在运行,但scrapy没有按顺序解析URL。例如,尽管我试图解析url1、url2、…、url100,但scrapy解析url2、url10、url1……等等 它解析所有的url,但当一个特定的url不存在时(例如example.com/unit.aspx?b_id=10),Firefox会显示我之前请求的结果。因为我想确保没有重复的URL,所以我需要确保循环按顺序解析URL,而不是“随意”For loop 我在scrapy中的循环没有按顺序运行,for-loop,while-loop,range,scrapy,For Loop,While Loop,Range,Scrapy,我正在抓取一系列URL。代码正在运行,但scrapy没有按顺序解析URL。例如,尽管我试图解析url1、url2、…、url100,但scrapy解析url2、url10、url1……等等 它解析所有的url,但当一个特定的url不存在时(例如example.com/unit.aspx?b_id=10),Firefox会显示我之前请求的结果。因为我想确保没有重复的URL,所以我需要确保循环按顺序解析URL,而不是“随意” 我尝试了“范围(1101)中的n”和“while bID您可以在请求对象中
我尝试了“范围(1101)中的n”和“while bID您可以在请求对象中使用priority属性”。Scrapy保证默认情况下在DFO中对URL进行爬网。但它不能确保URL在解析回调中的访问顺序 您希望返回一个请求数组,从中弹出对象直到其为空,而不是生成请求对象 有关更多信息,请参见此处
你可以试试这样的东西。我不确定它是否适合使用,因为我还没有看到spider代码的其余部分,但现在您可以看到:
# create a list of urls to be parsed, in reverse order (so we can easily pop items off)
crawl_urls = ['https://www.example.com/units.aspx?b_id=%s' % n for n in xrange(99,1,-1)]
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are successfully logged in.
"""
if "Welcome!" in response.body:
self.log("Successfully logged in. Let's start crawling!")
print "Successfully logged in. Let's start crawling!"
# Now the crawling can begin..
self.initialized()
return Request(url='https://www.example.com/units.aspx?b_id=1',dont_filter=True,callback=self.parse_add_tables,meta={'bID':1,'metaItems':[]})
else:
self.log("Something went wrong, we couldn't log in....Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_add_tables(self, response):
# parsing code here
if self.crawl_urls:
next_url = self.crawl_urls.pop()
return Request(url=next_url,dont_filter=True,callback=self.parse_add_tables,meta={'bID':int(next_url[-1:]),'metaItems':[]})
return items
谢谢你的回答!我搜索了索引,但没有找到这篇文章。我是python和scrapy新手,所以我需要学习更多关于如何更改默认属性的知识。
# create a list of urls to be parsed, in reverse order (so we can easily pop items off)
crawl_urls = ['https://www.example.com/units.aspx?b_id=%s' % n for n in xrange(99,1,-1)]
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are successfully logged in.
"""
if "Welcome!" in response.body:
self.log("Successfully logged in. Let's start crawling!")
print "Successfully logged in. Let's start crawling!"
# Now the crawling can begin..
self.initialized()
return Request(url='https://www.example.com/units.aspx?b_id=1',dont_filter=True,callback=self.parse_add_tables,meta={'bID':1,'metaItems':[]})
else:
self.log("Something went wrong, we couldn't log in....Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_add_tables(self, response):
# parsing code here
if self.crawl_urls:
next_url = self.crawl_urls.pop()
return Request(url=next_url,dont_filter=True,callback=self.parse_add_tables,meta={'bID':int(next_url[-1:]),'metaItems':[]})
return items