Scrapy:如何编写UserAgentMiddleware?

Scrapy:如何编写UserAgentMiddleware?,scrapy,Scrapy,我想为scrapy编写一个UserAgentMiddleware, 文件说: 允许爬行器覆盖默认用户代理的中间件。 为了使爬行器覆盖默认的用户代理,必须设置其用户代理属性 文件: 但是没有一个例子,我不知道怎么写。 有什么建议吗?您可以在安装路径中查看它 /Users/tarun.lalwani/.virtualenvs/project/lib/python3.6/site-packages/scrapy/downloadermiddleware/useragent.py “”“为每个爬行器设

我想为scrapy编写一个UserAgentMiddleware,
文件说:

允许爬行器覆盖默认用户代理的中间件。 为了使爬行器覆盖默认的用户代理,必须设置其用户代理属性

文件:

但是没有一个例子,我不知道怎么写。

有什么建议吗?

您可以在安装路径中查看它

/Users/tarun.lalwani/.virtualenvs/project/lib/python3.6/site-packages/scrapy/downloadermiddleware/useragent.py

“”“为每个爬行器设置用户代理标头或使用设置中的默认值”“”

您可以看到下面一个设置随机用户代理的示例


首先访问一些网站,获取一些最新的用户代理。然后在标准中间件中执行类似的操作。您可以在这里设置自己的代理设置。从文本文件中获取一个随机UA,并将其放入标题中。这是一个草率的例子,你想在顶部随机导入一个,并且确保在你完成它时关闭useragents.txt。我可能只是将它们加载到文档顶部的列表中

class GdataDownloaderMiddleware(object):
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        user_agents = open('useragents.txt', 'r')
        user_agents = user_agents.readlines()
        import random
        user_agent = random.choice(user_agents)
        request.headers.setdefault(b'User-Agent', user_agent)

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

我想添加一个简单的方法,就是将最新的用户代理逐行存储在文本文件中。然后可以执行以下操作:uas=file.readlines()导入随机ua=random.choice(uas)
class GdataDownloaderMiddleware(object):
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        user_agents = open('useragents.txt', 'r')
        user_agents = user_agents.readlines()
        import random
        user_agent = random.choice(user_agents)
        request.headers.setdefault(b'User-Agent', user_agent)

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)