Python 试图伪造和轮换用户代理

Python 试图伪造和轮换用户代理,python,scrapy,user-agent,scrapy-splash,splash-js-render,Python,Scrapy,User Agent,Scrapy Splash,Splash Js Render,我试图在Python中伪造用户代理并对其进行旋转。 我在网上找到了一个关于如何使用Scrapy using package实现这一点的教程。 我刮网页,以检查我的用户代理,看看它是否不同于我的,如果它旋转。它是否与我的实际用户代理不同,但它不旋转它每次都返回相同的用户代理,我无法找出哪里出了问题 设置.py BOT_NAME = 'scrapy_javascript' SPIDER_MODULES = ['scrapy_javascript.spiders'] NEWSPIDER_MODULE

我试图在Python中伪造用户代理并对其进行旋转。 我在网上找到了一个关于如何使用Scrapy using package实现这一点的教程。 我刮网页,以检查我的用户代理,看看它是否不同于我的,如果它旋转。它是否与我的实际用户代理不同,但它不旋转它每次都返回相同的用户代理,我无法找出哪里出了问题

设置.py

BOT_NAME = 'scrapy_javascript'

SPIDER_MODULES = ['scrapy_javascript.spiders']
NEWSPIDER_MODULE = 'scrapy_javascript.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_javascript (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True
DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# -----------------------------------------------------------------------------
# USER AGENT
# -----------------------------------------------------------------------------

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
}


USER_AGENTS = [
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/57.0.2987.110 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.79 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) '
     'Gecko/20100101 '
     'Firefox/55.0'),  # firefox
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.91 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/62.0.3202.89 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/63.0.3239.108 '
     'Safari/537.36'),  # chrome
]

SPLASH_URL = 'http://199.89.192.74:8050'


DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
BOT_NAME = 'scrapy_javascript'

SPIDER_MODULES = ['scrapy_javascript.spiders']
NEWSPIDER_MODULE = 'scrapy_javascript.spiders'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# The path of the csv file that contains the pairs
PROXY_CSV_FILE = "proxies.csv"

DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

#SPLASH_URL = 'http://127.0.0.1:8050'

#SPLASH_URL = 'http://localhost:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'



# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 16

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 60

通过创建包含所有url的csv文件,它们与IP和用户代理配对,所以每次访问网页时,我都使用这些IP和用户代理。然后我必须在我的爬行器中覆盖我的spalsh_url,这样我的splash_url就等于我当时使用的代理

SplashSpider.py

import csv
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from ..items import GameItem

# process the csv file so the url + ip address + useragent pairs are the same as defined in the file
# returns a list of dictionaries, example:
# [ {'url': 'http://www.starcitygames.com/catalog/category/Rivals%20of%20Ixalan',
#    'ip': 'http://204.152.114.244:8050',
#    'ua': "Mozilla/5.0 (BlackBerry; U; BlackBerry 9320; en-GB) AppleWebKit/534.11"},
#    ...
# ]
def process_csv(csv_file):
    data = []
    reader = csv.reader(csv_file)
    next(reader)
    for fields in reader:
        if fields[0] != "":
            url = fields[0]
        else:
            continue # skip the whole row if the url column is empty
        if fields[1] != "":
            ip = "http://" + fields[1] + ":8050" # adding http and port because this is the needed scheme
        if fields[2] != "":
            useragent = fields[2]
        data.append({"url": url, "ip": ip, "ua": useragent})
    return data


class MySpider(Spider):
    name = 'splash_spider'  # Name of Spider

    # notice that we don't need to define start_urls
    # just make sure to get all the urls you want to scrape inside start_requests function

    # getting all the url + ip address + useragent pairs then request them
    def start_requests(self):

        # get the file path of the csv file that contains the pairs from the settings.py
        with open(self.settings["PROXY_CSV_FILE"], mode="r") as csv_file:
           # requests is a list of dictionaries like this -> {url: str, ua: str, ip: str}
            requests = process_csv(csv_file)

        for req in requests:
            # no need to create custom middlewares
            # just pass useragent using the headers param, and pass proxy using the meta param

            yield SplashRequest(url=req["url"], callback=self.parse, args={"wait": 3},
                    headers={"User-Agent": req["ua"]},
                    splash_url = req["ip"],
                    )
设置.py

BOT_NAME = 'scrapy_javascript'

SPIDER_MODULES = ['scrapy_javascript.spiders']
NEWSPIDER_MODULE = 'scrapy_javascript.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_javascript (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True
DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# -----------------------------------------------------------------------------
# USER AGENT
# -----------------------------------------------------------------------------

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
}


USER_AGENTS = [
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/57.0.2987.110 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.79 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) '
     'Gecko/20100101 '
     'Firefox/55.0'),  # firefox
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.91 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/62.0.3202.89 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/63.0.3239.108 '
     'Safari/537.36'),  # chrome
]

SPLASH_URL = 'http://199.89.192.74:8050'


DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
BOT_NAME = 'scrapy_javascript'

SPIDER_MODULES = ['scrapy_javascript.spiders']
NEWSPIDER_MODULE = 'scrapy_javascript.spiders'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# The path of the csv file that contains the pairs
PROXY_CSV_FILE = "proxies.csv"

DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

#SPLASH_URL = 'http://127.0.0.1:8050'

#SPLASH_URL = 'http://localhost:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'



# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 16

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 60

在这里,您可以找到一个API,它以JSON的形式返回最常见的用户代理:

我使用了此工具,它将始终使用最新和最常用的用户代理更新您的用户代理列表:

从shadow\u useragent导入ShadowUserAgent shadow\u useragent=ShadowUserAgent printshadow_useragent.firefox Mozilla/5.0 X11;Ubuntu;Linux x86_64;rv:67.0 Gecko/20100101 Firefox/67.0 printshadow_useragent.chrome Mozilla/5.0 Windows NT 10.0;Win64;x64 AppleWebKit/537.36 KHTML,如Gecko Chrome/74.0.3729.169 Safari/537.36 printshadow_useragent.safari Mozilla/5.0 Macintosh;英特尔Mac OS X 10_14_5 AppleWebKit/605.1.15 KHTML,如Gecko版本/12.1.1 Safari/605.1.15 printshadow_useragent.edge Mozilla/5.0 Windows NT 10.0;Win64;x64 AppleWebKit/537.36 KHTML,如Gecko Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134 printshadow_useragent.ie Mozilla/5.0 Windows NT 6.1;WOW64;三叉戟/7.0;rv:11.0,像壁虎一样 printshadow_useragent.android Mozilla/5.0linux;U安卓4.3;恩美;SM-N900T构建/JSS15J AppleWebKit/534.30 KHTML,类似Gecko版本/4.0 Mobile Safari/534.30 printshadow_useragent.ipad Mozilla/5.0 iPad;CPU操作系统12_3_1,如Mac OS X AppleWebKit/605.1.15 KHTML,如Gecko Version/12.1.1 Mobile/15E148 Safari/604.1 printshadow_useragent.random Mozilla/5.0 Windows NT 6.3;Win64;x64;rv:68.0 Gecko/20100101 Firefox/68.0 printshadow\u useragent.random\u nomobile Mozilla/5.0 X11;Linux x86_64 AppleWebKit/537.36 KHTML,如Gecko Chrome/75.0.3770.90 Safari/537.36 最好的一个,通过真实世界的浏览器使用统计随机 printua.random Mozilla/5.0 X11;Linux x86_64 AppleWebKit/537.36 KHTML,如Gecko Chrome/75.0.3770.90 Safari/537.36 如果你想排除手机,一些网站会显示不同的页面 printshadow\u useragent.random\u nomobile Mozilla/5.0 X11;Linux x86_64 AppleWebKit/537.36 KHTML,如Gecko Chrome/75.0.3770.90 Safari/537.36 请参阅,如果这有助于scrapy useragent软件包readme使用优先级400,而代码中使用的优先级为500。是否只有在使用splash时才会发生这种情况?