Scrapy:如何编写UserAgentMiddleware?
我想为scrapy编写一个UserAgentMiddleware,Scrapy:如何编写UserAgentMiddleware?,scrapy,Scrapy,我想为scrapy编写一个UserAgentMiddleware, 文件说: 允许爬行器覆盖默认用户代理的中间件。 为了使爬行器覆盖默认的用户代理,必须设置其用户代理属性 文件: 但是没有一个例子,我不知道怎么写。 有什么建议吗?您可以在安装路径中查看它 /Users/tarun.lalwani/.virtualenvs/project/lib/python3.6/site-packages/scrapy/downloadermiddleware/useragent.py “”“为每个爬行器设
文件说: 允许爬行器覆盖默认用户代理的中间件。 为了使爬行器覆盖默认的用户代理,必须设置其用户代理属性 文件: 但是没有一个例子,我不知道怎么写。
有什么建议吗?您可以在安装路径中查看它 /Users/tarun.lalwani/.virtualenvs/project/lib/python3.6/site-packages/scrapy/downloadermiddleware/useragent.py “”“为每个爬行器设置用户代理标头或使用设置中的默认值”“” 您可以看到下面一个设置随机用户代理的示例
首先访问一些网站,获取一些最新的用户代理。然后在标准中间件中执行类似的操作。您可以在这里设置自己的代理设置。从文本文件中获取一个随机UA,并将其放入标题中。这是一个草率的例子,你想在顶部随机导入一个,并且确保在你完成它时关闭useragents.txt。我可能只是将它们加载到文档顶部的列表中
class GdataDownloaderMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
user_agents = open('useragents.txt', 'r')
user_agents = user_agents.readlines()
import random
user_agent = random.choice(user_agents)
request.headers.setdefault(b'User-Agent', user_agent)
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
我想添加一个简单的方法,就是将最新的用户代理逐行存储在文本文件中。然后可以执行以下操作:uas=file.readlines()导入随机ua=random.choice(uas)
class GdataDownloaderMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
user_agents = open('useragents.txt', 'r')
user_agents = user_agents.readlines()
import random
user_agent = random.choice(user_agents)
request.headers.setdefault(b'User-Agent', user_agent)
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)