Web scraping 更新列表,在迭代期间设置--while循环
我有一个粗略的脚本如下 1) 将导航路径收集到列表中并调用新的解析Web scraping 更新列表,在迭代期间设置--while循环,web-scraping,scrapy,Web Scraping,Scrapy,我有一个粗略的脚本如下 1) 将导航路径收集到列表中并调用新的解析 g_next_page_list = [] g_next_page_set = set() def parse(self,response): #code to extract nav_links for nav_link in nav_links: if nav_link not in g_next_page_set: g_next_page_list.append
g_next_page_list = []
g_next_page_set = set()
def parse(self,response):
#code to extract nav_links
for nav_link in nav_links:
if nav_link not in g_next_page_set:
g_next_page_list.append(nav_link)
g_next_page_set.add(nav_link)
for next_page in g_next_page_list:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse_start_url, dont_filter=True, )
我将parse_start_url定义为:
def parse_start_url(self,response):
#code to extract nav_links
for nav_link in nav_links:
if nav_link not in g_next_page_set:
g_next_page_list.append(nav_link)
g_next_page_set.add(nav_link)
但是,主解析中的全局列表和集合(g_next_page_set,g_next_page_list)不会被追加。我做错了什么
提前谢谢 这里不使用全局变量,而是使用self.variable\u name
g_next_page_list = []
g_next_page_set = set()
def parse(self,response):
#code to extract nav_links
for nav_link in nav_links:
if nav_link not in v_next_page_set:
self.g_next_page_list.append(nav_link)
self.g_next_page_set.add(nav_link)
for next_page in v_next_page_list:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse_start_url, dont_filter=True, )
def parse_start_url(self,response):
#code to extract nav_links
for nav_link in nav_links:
if nav_link not in v_next_page_set:
self.g_next_page_list.append(nav_link)
self.g_next_page_set.add(nav_link)
这应该能让它工作起来。是不是
v_next_page_list
和g_next_page_list
是同一件事,还是其中一件是打字错误?如果它们不同,请为v_next_page_list
@supersam654提供一些示例数据,很抱歉造成混淆。他们是一样的,我更新了我原来的帖子