Python 504网关超时-使用scrapy代理池和scrapy用户代理

Python 504网关超时-使用scrapy代理池和scrapy用户代理,python,scrapy,Python,Scrapy,我无法抓取数据,显示504 Gatway超时错误, 我尝试同时使用绕过方法UserAgent和Proxy,但无法帮助我抓取数据 我尝试了代理方法的scrapy代理池,以及useragetn方法的scrapy用户代理,但这两种方法都不起作用 正在获取504网关超时 我的斗志 import scrapy import time import random class LaughfactorySpider(scrapy.Spider): handle_httpstatus_list = [4

我无法抓取数据,显示504 Gatway超时错误, 我尝试同时使用绕过方法UserAgent和Proxy,但无法帮助我抓取数据

我尝试了代理方法的scrapy代理池,以及useragetn方法的scrapy用户代理,但这两种方法都不起作用

正在获取504网关超时

我的斗志

import scrapy
import time 
import random
class LaughfactorySpider(scrapy.Spider):
    handle_httpstatus_list = [403, 504]
    name = "myspider"
    start_urls = ["mywebsitewebsite"]

    def parse(self,response):
        time.sleep(random.randint(0,4))
        for site in response.xpath("//section[@class='test']/div/ul"):
            item = {
                'name': site.xpath("//li[@class='centr']//h2/span/text()").extract_first()
            }
            yield item
###### For Dynamic Proxy

ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
   'formsubmit_getresult.pipelines.FormsubmitGetresultPipeline': 300,
}
# To Enable Proxy
PROXY_POOL_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
    # ...
    'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
    'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
    # ...
}

####### For Dynamic USerAgent Middleware
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
设置.py

import scrapy
import time 
import random
class LaughfactorySpider(scrapy.Spider):
    handle_httpstatus_list = [403, 504]
    name = "myspider"
    start_urls = ["mywebsitewebsite"]

    def parse(self,response):
        time.sleep(random.randint(0,4))
        for site in response.xpath("//section[@class='test']/div/ul"):
            item = {
                'name': site.xpath("//li[@class='centr']//h2/span/text()").extract_first()
            }
            yield item
###### For Dynamic Proxy

ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
   'formsubmit_getresult.pipelines.FormsubmitGetresultPipeline': 300,
}
# To Enable Proxy
PROXY_POOL_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
    # ...
    'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
    'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
    # ...
}

####### For Dynamic USerAgent Middleware
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

您没有正确设置
用户代理
标题,这就是网站给您504的原因。您需要在第一个请求和所有后续请求中添加
User-Agent

试着这样做:

class LaughfactorySpider(scrapy.Spider):
    handle_httpstatus_list = [403, 504]
    name = "myspider"
    start_urls = ["mywebsitewebsite"]

    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
    }

    def start_requests(self):
        yield Request(self.start_urls[0], headers=self.headers)

    def parse(self,response):
        time.sleep(random.randint(0,4))
        for site in response.xpath("//section[@class='test']/div/ul"):
            item = {
                'name': site.xpath("//li[@class='centr']//h2/span/text()").extract_first()
            }
            yield item

希望能有所帮助

您有什么具体的问题吗?但我使用了“scrapy user Agent”中间件,我认为它也可以做同样的事情。您上面的解决方案在我这边不起作用。可能是其他软件包干扰了这一点。你能还原你的设置和中间件,然后试试这个吗?结果还是一样的吗