Python 提取验证码图像_Python_Python 2.7_Xpath_Lxml_Captcha

Python 提取验证码图像

python python-2.7 xpath

Python 提取验证码图像,python,python-2.7,xpath,lxml,captcha,Python,Python 2.7,Xpath,Lxml,Captcha,我正在用从验证码图像中提取的各种数字为神经网络建立训练集。我正在使用Python 2.7.3、lxml库和XPath选择器为了从captcha中获得正确的图像，我需要提取img src，它被动态加载到www中，每次都是不同的，因此我的Python代码是： import urllib from lxml import etree, html adres_prefix = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/" adres_sufi

我正在用从验证码图像中提取的各种数字为神经网络建立训练集。我正在使用Python 2.7.3、lxml库和XPath选择器

为了从captcha中获得正确的图像，我需要提取

img src

，它被动态加载到www中，每次都是不同的，因此我的Python代码是：

import urllib
from lxml import etree, html

adres_prefix = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/"
adres_sufix = etree.XPath('string(//img[@class="captcha"]/@src)')
sock = urllib.urlopen("https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/Search.aspx")
htmlSource = sock.read()                             
sock.close()
root = etree.HTML(htmlSource)
result = etree.tostring(root, pretty_print=True, method="html")
result2 = adres_sufix(root)
www = adres_prefix + result2
print www

所以每次我得到www：

https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/captcha.ashx?id=1b7d2b6d-70a6-4ce3-bedd-fe89038fb7f3&empty=1

有什么不对，因为当我把这个链接复制到我的浏览器时，我什么都没有得到

使用

我不知道怎么了。为什么XPath选择器获取“&empty=1”？

有什么想法吗？

原始HTML源代码确实有“empty=1”，因此您的代码是正确的。要获得图像，只需修剪掉“&empty=1”部分