Python 刮痧中的CAPTCHA_Python_Captcha_Scrapy

Python 刮痧中的CAPTCHA

python scrapy

Python 刮痧中的CAPTCHA,python,captcha,scrapy,Python,Captcha,Scrapy,我正在开发一个Scrapy应用程序，我试图用一个使用验证码的表单登录一个网站（这不是垃圾邮件）。我正在使用ImagesPipeline下载验证码，并将其打印到屏幕上供用户解决。到目前为止还不错我的问题是如何重新启动spider，以提交验证码/表单信息？现在，我的爬行器请求验证码页面，然后返回一个包含验证码的图像url的项。然后由图像SPIPELINE处理/下载，并显示给用户。我不清楚如何恢复spider的进度，并将已解决的验证码和同一会话传递给spider，因为我认为spider必须在Imag

我正在开发一个Scrapy应用程序，我试图用一个使用验证码的表单登录一个网站（这不是垃圾邮件）。我正在使用

ImagesPipeline

下载验证码，并将其打印到屏幕上供用户解决。到目前为止还不错

我的问题是如何重新启动spider，以提交验证码/表单信息？现在，我的爬行器请求验证码页面，然后返回一个包含验证码的

图像url

的

项。然后由图像SPIPELINE
处理/下载，并显示给用户。我不清楚如何恢复spider的进度，并将已解决的验证码和同一会话传递给spider，因为我认为spider必须在ImageSpiderline开始工作之前返回项目（例如退出）
我已经浏览了文档和示例，但没有找到任何文档和示例明确说明如何实现这一点。
这就是如何让它在spider中工作的方法
self.crawler.engine.pause()
process_my_captcha()
self.crawler.engine.unpause()

收到请求后，暂停引擎，显示图像，从用户处读取信息，并通过提交POST登录请求恢复爬网
我很想知道这种方法是否适用于你的情况
 我不会创建项目并使用ImagePipeline
import urllib
import os
import subprocess

...

def start_requests(self):
    request = Request("http://webpagewithcaptchalogin.com/", callback=self.fill_login_form)
    return [request]      

def fill_login_form(self,response):
    x = HtmlXPathSelector(response)
    img_src = x.select("//img/@src").extract()

    #delete the captcha file and use urllib to write it to disk
    os.remove("c:\captcha.jpg")
    urllib.urlretrieve(img_src[0], "c:\captcha.jpg")

    # I use an program here to show the jpg (actually send it somewhere)
    captcha = subprocess.check_output(r".\external_utility_solving_captcha.exe")

    # OR just get the input from the user from stdin
    captcha = raw_input("put captcha in manually>")  

    # this function performs the request and calls the process_home_page with
    # the response (this way you can chain pages from start_requests() to parse()

    return [FormRequest.from_response(response,formnumber=0,formdata={'user':'xxx','pass':'xxx','captcha':captcha},callback=self.process_home_page)]

    def process_home_page(self, response):
        # check if you logged in etc. etc. 


我在这里要做的是导入urllib.urlretrieve（url
）（存储图像）、os.remove（文件）
（删除上一个图像）和subprocess.checoutput
（调用外部命令行实用程序来解决验证码）。这个“黑客”并没有使用整个残破的基础设施，因为像这样解决验证码总是一个黑客
整个调用外部子流程的事情本来可以做得更好，但这是可行的
在某些网站上，无法保存验证码图像，您必须在浏览器中调用该页面，调用屏幕捕获实用程序，并在准确位置裁剪以“剪切”验证码。现在是屏幕抓取。
如何从爬虫程序代码内部调用ImagesPipeline？您只需从您在爬虫程序中解析的页面抓取图像即可。我还没有试过用friso的ImagesPipelineidea。如果您想手动处理它->def parse（self，response）：self.crawler.engine.pause（）captcha\u var=raw\u输入（“captcha:”）self.crawler.engine.unpause（）返回scrapy.FormRequest.from\u response（response，formdata={'codeTextBox'：captcha\u var}，callback=self.after\u login）def after\u login（self，response）：print（response.body）return你能告诉我们你是如何做到这一点的吗我正在将它打印到屏幕上，供用户解决哦，我使用了ASCII艺术，在给定像素区域的黑暗中使用了不同的字母。必须调整它使其工作。