Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/neo4j/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scrapy 如何在抓取文件类型网站时跳过父目录?_Scrapy_Web Crawler_Scrapy Spider_Google Crawlers_Scrapyd - Fatal编程技术网

Scrapy 如何在抓取文件类型网站时跳过父目录?

Scrapy 如何在抓取文件类型网站时跳过父目录?,scrapy,web-crawler,scrapy-spider,google-crawlers,scrapyd,Scrapy,Web Crawler,Scrapy Spider,Google Crawlers,Scrapyd,在浏览使用目录存储文件的基本文件夹系统网站时 yield scrapy.Request(url1, callback=self.parse) 跟踪链接并抓取已爬网链接的所有内容,但我通常会遇到爬网程序通过根目录链接,当根目录位于两者之间时,它会获得具有不同url的所有相同文件 http://example.com/root/sub/file http://example.com/root/sub/../sub/file 任何帮助都将不胜感激 下面是代码示例的一个片段 class fileSp

在浏览使用目录存储文件的基本文件夹系统网站时

yield scrapy.Request(url1, callback=self.parse)
跟踪链接并抓取已爬网链接的所有内容,但我通常会遇到爬网程序通过根目录链接,当根目录位于两者之间时,它会获得具有不同url的所有相同文件

http://example.com/root/sub/file
http://example.com/root/sub/../sub/file
任何帮助都将不胜感激

下面是代码示例的一个片段

class fileSpider(Spider):
    name = 'filespider'
    def __init__(self, filename=None):
        if filename:
            with open(filename, 'r') as f:
                self.start_urls =  [url.strip() for url in f.readlines()]

    def parse(self, response):
        item = Item()
        for url in response.xpath('//a/@href').extract():
            url1 = response.url + url
            if(url1[-4::] in videoext):
                item['name'] = url
                item['url'] = url1
                item['depth'] = response.meta["depth"]
                yield item
            elif(url1[-1]=='/'):
                yield scrapy.Request(url1, callback=self.parse)   
        pass

您可以使用
os.path.normpath
规范化所有路径,这样您就不会得到重复的路径:

import os
import urlparse
...

    def parse(self, response):
        item = Item()
        for url in response.xpath('//a/@href').extract():
            url1 = response.url + url

            # =======================
            url_parts = list(urlparse.urlparse(url1))
            url_parts[2] = os.path.normpath(url_parts[2])
            url1 = urlparse.urlunparse(url_parts)
            # =======================

            if(url1[-4::] in videoext):
                item['name'] = url
                item['url'] = url1
                item['depth'] = response.meta["depth"]
                yield item
            elif(url1[-1]=='/'):
                yield scrapy.Request(url1, callback=self.parse)   
        pass