Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/364.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何创建自定义的剪贴式URL过滤器以避免重复?_Python_Scrapy_Web Crawler - Fatal编程技术网

Python 如何创建自定义的剪贴式URL过滤器以避免重复?

Python 如何创建自定义的剪贴式URL过滤器以避免重复?,python,scrapy,web-crawler,Python,Scrapy,Web Crawler,我正在创建一个废弃的爬虫程序,但是默认的过滤器类RFPDupeFilte在应用程序中不能正常工作。爬虫给了我很多重复的内容 所以我试着举个例子, 但这对我不起作用。它给了我一个eroor导入错误:没有名为scraper.custom\u filters的模块,即使我将它保存在custom\u filters.py类的settings.py目录中 from scrapy.dupefilter import RFPDupeFilter class SeenURLFilter(RFPDupeFil

我正在创建一个废弃的爬虫程序,但是默认的过滤器类RFPDupeFilte在应用程序中不能正常工作。爬虫给了我很多重复的内容

所以我试着举个例子,

但这对我不起作用。它给了我一个eroor导入错误:没有名为scraper.custom\u filters的模块,即使我将它保存在custom\u filters.py类的settings.py目录中

from scrapy.dupefilter import RFPDupeFilter

class SeenURLFilter(RFPDupeFilter):
    """A dupe filter that considers the URL"""

    def __init__(self, path=None):
        self.urls_seen = set()
        RFPDupeFilter.__init__(self, path)

    def request_seen(self, request):
        if request.url in self.urls_seen:
            return True
        else:
            self.urls_seen.add(request.url)
将DUPEFILTER_类常量添加到settings.py:

DUPEFILTER_CLASS = 'scraper.custom_filters.SeenURLFilter'

DUPEFILTER\u CLASS='scraper.custom\u filters.SeenURLFilter'
中指定的路径错误,导致导入错误。很可能您丢失了一个包,或者包含了一个您不应该包含的包

对于您的项目,找到“scrapy.cfg”文件,并从该点跟踪目录结构,以确定要在字符串中使用的名称空间。要使您的目录结构正确,您的目录结构需要类似于:

myproject
   |---<scraper>
   |   |---<spiders>
   |   |   |---__init__.py
   |   |   |---myspider.py
   |   |---__init__.py
   |   |---<...>
   |   |---custom_filters.py
   |   |---settings.py
   |---scrapy.cfg
myproject
|---
|   |---
|| |--uuu init_uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
|| |--myspider.py
||--uuu init_uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
|   |---
||--custom_filters.py
||--settings.py
|---刮痧

您必须共享您尝试的代码并显示完整的错误消息。好的,我只是更新了示例中的源代码。您是否也可以将目录结构添加到问题中?