Python 如何创建自定义的剪贴式URL过滤器以避免重复?
我正在创建一个废弃的爬虫程序,但是默认的过滤器类RFPDupeFilte在应用程序中不能正常工作。爬虫给了我很多重复的内容 所以我试着举个例子, 但这对我不起作用。它给了我一个eroor导入错误:没有名为scraper.custom\u filters的模块,即使我将它保存在custom\u filters.py类的settings.py目录中Python 如何创建自定义的剪贴式URL过滤器以避免重复?,python,scrapy,web-crawler,Python,Scrapy,Web Crawler,我正在创建一个废弃的爬虫程序,但是默认的过滤器类RFPDupeFilte在应用程序中不能正常工作。爬虫给了我很多重复的内容 所以我试着举个例子, 但这对我不起作用。它给了我一个eroor导入错误:没有名为scraper.custom\u filters的模块,即使我将它保存在custom\u filters.py类的settings.py目录中 from scrapy.dupefilter import RFPDupeFilter class SeenURLFilter(RFPDupeFil
from scrapy.dupefilter import RFPDupeFilter
class SeenURLFilter(RFPDupeFilter):
"""A dupe filter that considers the URL"""
def __init__(self, path=None):
self.urls_seen = set()
RFPDupeFilter.__init__(self, path)
def request_seen(self, request):
if request.url in self.urls_seen:
return True
else:
self.urls_seen.add(request.url)
将DUPEFILTER_类常量添加到settings.py:
DUPEFILTER_CLASS = 'scraper.custom_filters.SeenURLFilter'
在
DUPEFILTER\u CLASS='scraper.custom\u filters.SeenURLFilter'
中指定的路径错误,导致导入错误。很可能您丢失了一个包,或者包含了一个您不应该包含的包
对于您的项目,找到“scrapy.cfg”文件,并从该点跟踪目录结构,以确定要在字符串中使用的名称空间。要使您的目录结构正确,您的目录结构需要类似于:
myproject
|---<scraper>
| |---<spiders>
| | |---__init__.py
| | |---myspider.py
| |---__init__.py
| |---<...>
| |---custom_filters.py
| |---settings.py
|---scrapy.cfg
myproject
|---
| |---
|| |--uuu init_uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
|| |--myspider.py
||--uuu init_uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
| |---
||--custom_filters.py
||--settings.py
|---刮痧
您必须共享您尝试的代码并显示完整的错误消息。好的,我只是更新了示例中的源代码。您是否也可以将目录结构添加到问题中?