Python 在使用deathbycaptcha服务处理Google recaptcha v2时,如何控制scrapy中的请求流?
您好:)我正在使用python使用scrapy web爬行框架,通过deathbycaptcha服务抓取一个网站并解决我在他们的页面上遇到的验证码问题。我的下载延迟被设置为30秒,我只需要抓取几页就可以获得基本信息,这样我就不会对网站的带宽造成太大的干扰。我将刮擦视为普通浏览器上的一种体验 那么首先让我们谈谈这些问题 第1期(代码) 我怎样才能让scrapy基本上停止创建新的请求,或者在解决验证码的过程中过多地干扰验证码?我尝试了很多不同的方法都没有效果,我对scrapy还比较陌生,所以我不太擅长编辑下载程序中间件或scrapy引擎代码,但如果这是唯一的方法,那就这样吧,但我希望有一个非常简单有效的解决方案,让验证码来完成它,没有新的请求打断它 第2期(代码) 我如何修复这个计时器函数,我认为它与第一个问题有点相关。如果验证码超时未解决,则它将永远不会重置Python 在使用deathbycaptcha服务处理Google recaptcha v2时,如何控制scrapy中的请求流?,python,api,web-scraping,scrapy,recaptcha,Python,Api,Web Scraping,Scrapy,Recaptcha,您好:)我正在使用python使用scrapy web爬行框架,通过deathbycaptcha服务抓取一个网站并解决我在他们的页面上遇到的验证码问题。我的下载延迟被设置为30秒,我只需要抓取几页就可以获得基本信息,这样我就不会对网站的带宽造成太大的干扰。我将刮擦视为普通浏览器上的一种体验 那么首先让我们谈谈这些问题 第1期(代码) 我怎样才能让scrapy基本上停止创建新的请求,或者在解决验证码的过程中过多地干扰验证码?我尝试了很多不同的方法都没有效果,我对scrapy还比较陌生,所以我不太擅
验证码运行
布尔值,并持续禁止验证码再次尝试解决。计时器是我试图解决第一个问题的方法之一,但是。。。我得到了一个错误。我不确定这是否与import语句中的threading
和timeit
中提取的内容有关,但我认为这并没有太大区别。有人能给我指点一下修复计时器语句的正确方向吗
正如我所说的,DeathByCaptchaAPI运行良好,当它有机会运行时,但是scrapy请求确实存在干扰,我还没有找到解决这个问题的相关解决方案。再一次,我不是一个爱刮胡子的专家,所以有些事情已经远远超出了我的舒适区,这需要推动,但不是太难,我最终打破了一切xD感谢你的帮助,非常感谢!很抱歉这个超长的问题
无论如何,该页面允许您查找两个结果,大约40-60页之后,它会重定向到包含recaptcha v2的验证码页面。deathbycaptcha服务有一个用于解决recaptcha v2的API,但不幸的是,它们的解决时间有时可能超过几分钟,这非常令人失望,但确实发生了。因此,我自然地将我的DOWNLOAD\u TIMEOUT
设置调整为240
秒,这样它就有足够的时间来解决验证码问题,然后继续抓取,这样它就不会再重定向了。我的刮擦设置如下:
CONCURRENT_REQUESTS = 1
DEPTH_LIMIT = 1
DOWNLOAD_DELAY = 30
CONCURRENT_REQUESTS_PER_DOMAIN = 1
CONCURRENT_REQUESTS_PER_IP = 1
DOWNLOAD_TIMEOUT = 240
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 10
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
当然还有其他的,但我认为这些是我的问题中最重要的。我启用了一个扩展名,然后中间件中有一些额外的东西,因为我还在这个文件中使用docker和scrapy splash
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
MYEXT_ENABLED = False
MYEXT_ITEMCOUNT = 100
EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
'scrapy.extensions.spideroclog.SpiderOpenCloseLogging':500,
}
所以我不认为这些东西会对验证码或下载程序中间软件产生很大影响。。。但以下是我的刮板上的一些代码:
Python:
import sys
import os
sys.path.append(r'F:\Documents\ScrapyDirectory\scrapername\scrapername\spiders')
import deathbycaptcha
import json
import scrapy
import requests
from datetime import datetime
import math
import urllib
import time
from scrapy_splash import SplashRequest
from threading import Timer
from timeit import Timer
class scrapername(scrapy.Spider):
name = "scrapername"
start_urls = []
global scrapeUrlList
global charCompStorage
global captchaIsRunning
r = requests.get('http://example.com/examplejsonfeed.php')
myObject = json.loads(r.text)
#print("Loading names...")
for o in myObject['objects']:
#a huge function for creating basically a lot of objects and appending links created from these objects to the scrapeUrlList function
print(len(scrapeUrlList))
for url in scrapeUrlList:
start_urls.append(url[1])
#add all those urls that just got created to the start_urls list
link_collection = []
def resetCaptchaInformation():
global captchaIsRunning
if captchaIsRunning:
captchaIsRunning = False
def afterCaptchaSubmit(self, response):
global captchaIsRunning
print("Captcha submitted: " + response.request.url)
captchaIsRunning = False
def parse(self, response):
global captchaIsRunning
self.logger.info("got response %s for %r" % (response.status, response.url))
if "InternalCaptcha" in response.request.url:
#checks for captcha in the url and if it's there it starts running the captcha solver API
if not captchaIsRunning:
#I have this statement here as a deterrent to prevent the captcha solver from starting again and again and
#again with every new request (which it does) *ISSUE 1*
if "captchasubmit" in response.request.url:
print("Found captcha submit in url")
else:
print("Internal Captcha is activated")
captchaIsRunning = True
t = Timer(240.0, self.resetCaptchaInformation)
#so I have been having major issues here not sure why?
#*ISSUE 2*
t.start()
username = "username"
password = "password"
print("Set username and password")
Captcha_dict = {
'googlekey': '6LcMUhgUAAAAAPn2MfvqN9KYxj7KVut-oCG2oCoK',
'pageurl': response.request.url}
print("Created catpcha dict")
json_Captcha = json.dumps(Captcha_dict)
print("json.dumps on captcha dict:")
print(json_Captcha)
client = deathbycaptcha.SocketClient(username, password)
print("Set up client with deathbycaptcha socket client")
try:
print("Trying to solve captcha")
balance = client.get_balance()
print("Remaining Balance: " + str(balance))
# Put your CAPTCHA type and Json payload here:
captcha = client.decode(type=4,token_params=json_Captcha)
if captcha:
# The CAPTCHA was solved; captcha["captcha"] item holds its
# numeric ID, and captcha["text"] item its a text token".
print("CAPTCHA %s solved: %s" % (captcha["captcha"], captcha["text"]))
data = {
'g-recaptcha-response':captcha["text"],
}
try:
dest = response.xpath("/html/body/form/@action").extract_first()
print("Form URL: " + dest)
submitURL = "https://exampleaddress.com" + dest
yield scrapy.FormRequest(url=submitURL, formdata=data, callback=self.afterCaptchaSubmit, dont_filter = True)
print("Yielded form request")
if '': # check if the CAPTCHA was incorrectly solved
client.report(captcha["captcha"])
except TypeError:
sys.exit()
except deathbycaptcha.AccessDeniedException:
# Access to DBC API denied, check your credentials and/or balance
print("error: Access to DBC API denied, check your credentials and/or balance")
else:
pass
else:
print("no Captcha")
#this will run if no captcha is on the page that the redirect landed on
#and basically parses all the information on the page
2018-07-19 14:10:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dThomas%2520Garrett%26citystatezip%3dLas%2520Vegas%2c%2520Nv> from <GET https://www.exampleaddress.com/results?name=Thomas%20Garrett&citystatezip=Las%20Vegas,%20Nv>
2018-07-19 14:10:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dThomas%2520Garrett%26citystatezip%3dLas%2520Vegas%2c%2520Nv> (referer: None)
2018-07-19 14:10:49 [scrapername] INFO: got response 200 for 'https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dThomas%2520Garrett%26citystatezip%3dLas%2520Vegas%2c%2520Nv'
Internal Captcha is activated
2018-07-19 14:10:49 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dThomas%2520Garrett%26citystatezip%3dLas%2520Vegas%2c%2520Nv> (referer: None)
Traceback (most recent call last):
File "F:\Program Files (x86)\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "F:\Program Files (x86)\Anaconda3\lib\site-packages\scrapy_splash\middleware.py", line 156, in process_spider_output
for el in result:
File "F:\Program Files (x86)\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "F:\Program Files (x86)\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "F:\Program Files (x86)\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "F:\Program Files (x86)\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "F:\Documents\ScrapyDirectory\scraperName\scraperName\spiders\scraperName- Copy.py", line 232, in parse
t = Timer(240.0, self.resetCaptchaInformation)
File "F:\Program Files (x86)\Anaconda3\lib\timeit.py", line 130, in __init__
raise ValueError("stmt is neither a string nor callable")
ValueError: stmt is neither a string nor callable
2018-07-19 14:10:53 [scrapy.extensions.logstats] INFO: Crawled 63 pages (at 2 pages/min), scraped 13 items (at 0 items/min)
2018-07-19 14:11:02 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dSamuel%2520Van%2520Cleave%26citystatezip%3dLas%2520Vegas%2c%2520Nv> from <GET https://www.exampleaddress.com/results?name=Samuel%20Van%20Cleave&citystatezip=Las%20Vegas,%20Nv>
2018-07-19 14:11:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dSamuel%2520Van%2520Cleave%26citystatezip%3dLas%2520Vegas%2c%2520Nv> (referer: None)
2018-07-19 14:11:13 [scrapername] INFO: got response 200 for 'https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dSamuel%2520Van%2520Cleave%26citystatezip%3dLas%2520Vegas%2c%2520Nv'
#and then an endless supply of 302 redirects, and 200 response for their crawl
#nothing happens, because the Timer failed, the captcha never solved?
#I'm not sure what is going wrong with it, hence the issues I am having
对于所有这些代码,非常抱歉,感谢您耐心阅读。如果你有什么问题,为什么会有这样的事情,只要问我,我可以解释。所以验证码确实解决了问题。这不是问题所在。当scraper运行时,有许多请求发生,它运行到302重定向,然后得到200响应并爬网页面,检测验证码并开始解决它。然后scrapy发送另一个请求,该请求在验证码页面上获得302重定向,200响应,并检测验证码并再次尝试解决它。它多次启动API,浪费了我的令牌。因此,如果不是captchaIsRunning:
语句将阻止这种情况发生。这是我现在看到的当它点击captcha时输出的碎片日志,记住之前的一切都很好,运行我所有的解析日志
刮削原木:
import sys
import os
sys.path.append(r'F:\Documents\ScrapyDirectory\scrapername\scrapername\spiders')
import deathbycaptcha
import json
import scrapy
import requests
from datetime import datetime
import math
import urllib
import time
from scrapy_splash import SplashRequest
from threading import Timer
from timeit import Timer
class scrapername(scrapy.Spider):
name = "scrapername"
start_urls = []
global scrapeUrlList
global charCompStorage
global captchaIsRunning
r = requests.get('http://example.com/examplejsonfeed.php')
myObject = json.loads(r.text)
#print("Loading names...")
for o in myObject['objects']:
#a huge function for creating basically a lot of objects and appending links created from these objects to the scrapeUrlList function
print(len(scrapeUrlList))
for url in scrapeUrlList:
start_urls.append(url[1])
#add all those urls that just got created to the start_urls list
link_collection = []
def resetCaptchaInformation():
global captchaIsRunning
if captchaIsRunning:
captchaIsRunning = False
def afterCaptchaSubmit(self, response):
global captchaIsRunning
print("Captcha submitted: " + response.request.url)
captchaIsRunning = False
def parse(self, response):
global captchaIsRunning
self.logger.info("got response %s for %r" % (response.status, response.url))
if "InternalCaptcha" in response.request.url:
#checks for captcha in the url and if it's there it starts running the captcha solver API
if not captchaIsRunning:
#I have this statement here as a deterrent to prevent the captcha solver from starting again and again and
#again with every new request (which it does) *ISSUE 1*
if "captchasubmit" in response.request.url:
print("Found captcha submit in url")
else:
print("Internal Captcha is activated")
captchaIsRunning = True
t = Timer(240.0, self.resetCaptchaInformation)
#so I have been having major issues here not sure why?
#*ISSUE 2*
t.start()
username = "username"
password = "password"
print("Set username and password")
Captcha_dict = {
'googlekey': '6LcMUhgUAAAAAPn2MfvqN9KYxj7KVut-oCG2oCoK',
'pageurl': response.request.url}
print("Created catpcha dict")
json_Captcha = json.dumps(Captcha_dict)
print("json.dumps on captcha dict:")
print(json_Captcha)
client = deathbycaptcha.SocketClient(username, password)
print("Set up client with deathbycaptcha socket client")
try:
print("Trying to solve captcha")
balance = client.get_balance()
print("Remaining Balance: " + str(balance))
# Put your CAPTCHA type and Json payload here:
captcha = client.decode(type=4,token_params=json_Captcha)
if captcha:
# The CAPTCHA was solved; captcha["captcha"] item holds its
# numeric ID, and captcha["text"] item its a text token".
print("CAPTCHA %s solved: %s" % (captcha["captcha"], captcha["text"]))
data = {
'g-recaptcha-response':captcha["text"],
}
try:
dest = response.xpath("/html/body/form/@action").extract_first()
print("Form URL: " + dest)
submitURL = "https://exampleaddress.com" + dest
yield scrapy.FormRequest(url=submitURL, formdata=data, callback=self.afterCaptchaSubmit, dont_filter = True)
print("Yielded form request")
if '': # check if the CAPTCHA was incorrectly solved
client.report(captcha["captcha"])
except TypeError:
sys.exit()
except deathbycaptcha.AccessDeniedException:
# Access to DBC API denied, check your credentials and/or balance
print("error: Access to DBC API denied, check your credentials and/or balance")
else:
pass
else:
print("no Captcha")
#this will run if no captcha is on the page that the redirect landed on
#and basically parses all the information on the page
2018-07-19 14:10:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dThomas%2520Garrett%26citystatezip%3dLas%2520Vegas%2c%2520Nv> from <GET https://www.exampleaddress.com/results?name=Thomas%20Garrett&citystatezip=Las%20Vegas,%20Nv>
2018-07-19 14:10:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dThomas%2520Garrett%26citystatezip%3dLas%2520Vegas%2c%2520Nv> (referer: None)
2018-07-19 14:10:49 [scrapername] INFO: got response 200 for 'https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dThomas%2520Garrett%26citystatezip%3dLas%2520Vegas%2c%2520Nv'
Internal Captcha is activated
2018-07-19 14:10:49 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dThomas%2520Garrett%26citystatezip%3dLas%2520Vegas%2c%2520Nv> (referer: None)
Traceback (most recent call last):
File "F:\Program Files (x86)\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "F:\Program Files (x86)\Anaconda3\lib\site-packages\scrapy_splash\middleware.py", line 156, in process_spider_output
for el in result:
File "F:\Program Files (x86)\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "F:\Program Files (x86)\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "F:\Program Files (x86)\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "F:\Program Files (x86)\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "F:\Documents\ScrapyDirectory\scraperName\scraperName\spiders\scraperName- Copy.py", line 232, in parse
t = Timer(240.0, self.resetCaptchaInformation)
File "F:\Program Files (x86)\Anaconda3\lib\timeit.py", line 130, in __init__
raise ValueError("stmt is neither a string nor callable")
ValueError: stmt is neither a string nor callable
2018-07-19 14:10:53 [scrapy.extensions.logstats] INFO: Crawled 63 pages (at 2 pages/min), scraped 13 items (at 0 items/min)
2018-07-19 14:11:02 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dSamuel%2520Van%2520Cleave%26citystatezip%3dLas%2520Vegas%2c%2520Nv> from <GET https://www.exampleaddress.com/results?name=Samuel%20Van%20Cleave&citystatezip=Las%20Vegas,%20Nv>
2018-07-19 14:11:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dSamuel%2520Van%2520Cleave%26citystatezip%3dLas%2520Vegas%2c%2520Nv> (referer: None)
2018-07-19 14:11:13 [scrapername] INFO: got response 200 for 'https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dSamuel%2520Van%2520Cleave%26citystatezip%3dLas%2520Vegas%2c%2520Nv'
#and then an endless supply of 302 redirects, and 200 response for their crawl
#nothing happens, because the Timer failed, the captcha never solved?
#I'm not sure what is going wrong with it, hence the issues I am having
2018-07-19 14:10:35[scrapy.downloadermiddleware.redirect]调试:重定向(302)到
2018-07-19 14:10:49[刮屑核心引擎]调试:爬网(200)(参考:无)
2018-07-19 14:10:49[scrapername]信息:收到200条关于'https://www.exampleaddress.com/InternalCaptcha?returnUrl=%2fresults%3fname%3dThomas%2520Garrett%26citystatezip%3dLas%2520Vegas%2c%2520Nv'
内部验证码已激活
2018-07-19 14:10:49[刮片机芯刮片机]错误:十字轴错误处理(参考:无)
回溯(最近一次呼叫最后一次):
iter\u errback中的文件“F:\Program Files(x86)\Anaconda3\lib\site packages\scrapy\utils\defer.py”,第102行
下一个(it)
文件“F:\Program Files(x86)\Anaconda3\lib\site packages\scrapy\u splash\middleware.py”,第156行,进程中\u spider\u输出
对于结果中的el:
文件“F:\Program Files(x86)\Anaconda3\lib\site packages\scrapy\spidermiddleware\offsite.py”,第29行,进程中\u spider\u输出
对于结果中的x:
文件“F:\Program Files(x86)\Anaconda3\lib\site packages\scrapy\spidermiddleware\referer.py”,第339行,在
返回(_set_referer(r)表示结果中的r或())
文件“F:\Program Files(x86)\Anaconda3\lib\site packages\scrapy\spidermiddleware\urlength.py”,第37行,在
返回(结果中的r表示r或()如果_过滤器(r))
文件“F:\Program Files(x86)\Anaconda3\lib\site packages\scrapy\spidermiddleware\depth.py”,第58行,在
返回(结果中的r表示r或()如果_过滤器(r))
文件“F:\Documents\ScrapyDirectory\scraperName\scraperName\spiders\scraperName-Copy.py”,第232行,在parse中
t=定时器(240.0,自复位)
文件“F:\Program Files(x86)\Anaconda3\lib\timeit.py”,第130行,在\uuu init中__
raise VALUERROR(“stmt既不是字符串也不是可调用的”)
ValueError:stmt既不是st