Python 504网关超时-使用scrapy代理池和scrapy用户代理
我无法抓取数据,显示504 Gatway超时错误, 我尝试同时使用绕过方法UserAgent和Proxy,但无法帮助我抓取数据 我尝试了代理方法的scrapy代理池,以及useragetn方法的scrapy用户代理,但这两种方法都不起作用 正在获取504网关超时 我的斗志Python 504网关超时-使用scrapy代理池和scrapy用户代理,python,scrapy,Python,Scrapy,我无法抓取数据,显示504 Gatway超时错误, 我尝试同时使用绕过方法UserAgent和Proxy,但无法帮助我抓取数据 我尝试了代理方法的scrapy代理池,以及useragetn方法的scrapy用户代理,但这两种方法都不起作用 正在获取504网关超时 我的斗志 import scrapy import time import random class LaughfactorySpider(scrapy.Spider): handle_httpstatus_list = [4
import scrapy
import time
import random
class LaughfactorySpider(scrapy.Spider):
handle_httpstatus_list = [403, 504]
name = "myspider"
start_urls = ["mywebsitewebsite"]
def parse(self,response):
time.sleep(random.randint(0,4))
for site in response.xpath("//section[@class='test']/div/ul"):
item = {
'name': site.xpath("//li[@class='centr']//h2/span/text()").extract_first()
}
yield item
###### For Dynamic Proxy
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'formsubmit_getresult.pipelines.FormsubmitGetresultPipeline': 300,
}
# To Enable Proxy
PROXY_POOL_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
# ...
'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
# ...
}
####### For Dynamic USerAgent Middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
设置.py
import scrapy
import time
import random
class LaughfactorySpider(scrapy.Spider):
handle_httpstatus_list = [403, 504]
name = "myspider"
start_urls = ["mywebsitewebsite"]
def parse(self,response):
time.sleep(random.randint(0,4))
for site in response.xpath("//section[@class='test']/div/ul"):
item = {
'name': site.xpath("//li[@class='centr']//h2/span/text()").extract_first()
}
yield item
###### For Dynamic Proxy
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'formsubmit_getresult.pipelines.FormsubmitGetresultPipeline': 300,
}
# To Enable Proxy
PROXY_POOL_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
# ...
'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
# ...
}
####### For Dynamic USerAgent Middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
您没有正确设置
用户代理
标题,这就是网站给您504的原因。您需要在第一个请求和所有后续请求中添加User-Agent
头
试着这样做:
class LaughfactorySpider(scrapy.Spider):
handle_httpstatus_list = [403, 504]
name = "myspider"
start_urls = ["mywebsitewebsite"]
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
}
def start_requests(self):
yield Request(self.start_urls[0], headers=self.headers)
def parse(self,response):
time.sleep(random.randint(0,4))
for site in response.xpath("//section[@class='test']/div/ul"):
item = {
'name': site.xpath("//li[@class='centr']//h2/span/text()").extract_first()
}
yield item
希望能有所帮助您有什么具体的问题吗?但我使用了“scrapy user Agent”中间件,我认为它也可以做同样的事情。您上面的解决方案在我这边不起作用。可能是其他软件包干扰了这一点。你能还原你的设置和中间件,然后试试这个吗?结果还是一样的吗