Scrapy 如何为不同的爬行器设置不同的刮擦设置?
我想为一些spider启用一些http代理,并为其他spider禁用它们 我可以这样做吗Scrapy 如何为不同的爬行器设置不同的刮擦设置?,scrapy,Scrapy,我想为一些spider启用一些http代理,并为其他spider禁用它们 我可以这样做吗 # settings.py proxy_spiders = ['a1' , b2'] if spider in proxy_spider: #how to get spider name ??? HTTP_PROXY = 'http://127.0.0.1:8123' DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.Ra
# settings.py
proxy_spiders = ['a1' , b2']
if spider in proxy_spider: #how to get spider name ???
HTTP_PROXY = 'http://127.0.0.1:8123'
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
else:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
如果上面的代码不起作用,还有其他建议吗?为什么不使用两个项目而不是一个 让我们用
proj1
和proj2
来命名这两个项目。在proj1
的settings.py
中,放置以下设置:
HTTP_PROXY = 'http://127.0.0.1:8123'
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
在proj2
的settings.py
中,放置以下设置:
HTTP_PROXY = 'http://127.0.0.1:8123'
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
您可以定义自己的代理中间件,如下所示:
from scrapy.contrib.downloadermiddleware import HttpProxyMiddleware
class ConditionalProxyMiddleware(HttpProxyMiddleware):
def process_request(self, request, spider):
if getattr(spider, 'use_proxy', None):
return super(ConditionalProxyMiddleware, self).process_request(request, spider)
class MySpider(scrapy.Spider):
name = "my_spider"
custom_settings = {"HTTP_PROXY":'http://127.0.0.1:8123',
"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}
class MySpider2(scrapy.Spider):
name = "my_spider2"
custom_settings = {"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}
然后在要启用代理的spider中定义属性
use_proxy=True
。不要忘记禁用默认代理中间件并启用修改后的代理中间件。您可以在spider.py文件中添加setting.overrides
有效的示例:
from scrapy.conf import settings
settings.overrides['DOWNLOAD_TIMEOUT'] = 300
对你来说,这样的事情也应该起作用
from scrapy.conf import settings
settings.overrides['DOWNLOADER_MIDDLEWARES'] = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
有点晚了,但自1.0.0版以来,scrapy中有一个新功能,您可以覆盖每个spider的设置,如下所示:
from scrapy.contrib.downloadermiddleware import HttpProxyMiddleware
class ConditionalProxyMiddleware(HttpProxyMiddleware):
def process_request(self, request, spider):
if getattr(spider, 'use_proxy', None):
return super(ConditionalProxyMiddleware, self).process_request(request, spider)
class MySpider(scrapy.Spider):
name = "my_spider"
custom_settings = {"HTTP_PROXY":'http://127.0.0.1:8123',
"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}
class MySpider2(scrapy.Spider):
name = "my_spider2"
custom_settings = {"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}
有一种新的更简单的方法可以做到这一点
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'SOME_SETTING': 'some value',
}
我使用Scrapy 1.3.1这不是用户想要做的,在某些情况下,您需要在同一个项目中使用多个spider。在使用Scrapy.conf import settingssettings中的
之前,请不要忘记这一点。在大于1的Scrapy版本中,已不推荐使用覆盖。在爬行器声明中使用自定义设置词典是可行的。我发现名称错误“AUTOTHROTTLE\u ENABLED”未定义。您的代码段未显示您正在导入设置.py