Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/363.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/shell/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Scrapy根据起始URL列表保留所有唯一页面_Python_Web Scraping_Scrapy_Scrapy Spider - Fatal编程技术网

Python Scrapy根据起始URL列表保留所有唯一页面

Python Scrapy根据起始URL列表保留所有唯一页面,python,web-scraping,scrapy,scrapy-spider,Python,Web Scraping,Scrapy,Scrapy Spider,我想给Scrapy一个起始URL列表,让它访问每个起始页上的每个链接。对于每个链接,如果它以前没有访问过该页面,我希望下载该页面并将其保存在本地。如何实现这一点?将默认解析回调设置为剥离所有链接 def parse(self, response): links = LinkExtractor().extract_links(response) return (Request(url=link.url, callback=self.parse_page) for link in l

我想给Scrapy一个起始URL列表,让它访问每个起始页上的每个链接。对于每个链接,如果它以前没有访问过该页面,我希望下载该页面并将其保存在本地。如何实现这一点?

将默认解析回调设置为剥离所有链接

def parse(self, response):
    links = LinkExtractor().extract_links(response)
    return (Request(url=link.url, callback=self.parse_page) for link in links)

def parse_page(self, response):
    # name = manipulate response.url to be a unique file name
    with open(name, 'wb') as f:
        f.write(response.body)