base64字符串python上的PyteSeract
我有一大堆用于png图像的base64格式的图像字符串。它们是电话号码(请参阅我的工作示例,使用号码中的src标记)。我想通过pytesseract运行它们来提取数字 我从这里的答案中得到了一些指导: 我尝试了几种公式,但我似乎不知道如何将字符串正确加载到PIL中,以便在其上运行pytesseract。以下是一个尝试的示例:base64字符串python上的PyteSeract,python,python-2.7,base64,python-imaging-library,tesseract,Python,Python 2.7,Base64,Python Imaging Library,Tesseract,我有一大堆用于png图像的base64格式的图像字符串。它们是电话号码(请参阅我的工作示例,使用号码中的src标记)。我想通过pytesseract运行它们来提取数字 我从这里的答案中得到了一些指导: 我尝试了几种公式,但我似乎不知道如何将字符串正确加载到PIL中,以便在其上运行pytesseract。以下是一个尝试的示例: from PIL import Image import base64 import pytesseract import cStringIO imgstring = '
from PIL import Image
import base64
import pytesseract
import cStringIO
imgstring = 'data: image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGcAAAAVCAYAAABbq/AzAAAACXBIWXMAAA7EAAAOxAGVKw4bAAADiUlEQVRoge3YTWgeVRQG4IcQJJQSQihBNIQiXYUSpJSgIF1IkSKllCJFQggutBQRQRfFH3RTRFwVERdBRHcuREVEupASRIP4C7VoFSmWSq2gtRGjNm21Ls79+k0mM3cmLrpxXhjm++aec973/sy55w4dOnTo0KFDhw4FLOIe3IWrmWtbwWcfLuDhinj7avzvTjyLDXp24F2cxx84gQMVdgfxHVbwFe5tqaEObXxynDCAQ/g2af+8hjM3ftcwIzoPQ5iouObEQG1INvP4PvlVBT+IYxVxhlJ7Vad6GMICHsQkNqWO/I7Zgt1s6twejGEvfhMLrI2GKjT5NHHCs6l/dyTtM8nnzkL/msbvGhbFTOdwDIfT7514GyN4ryb403glE+8JvN/AWcY8Xi/8P45HK+J+0FJDFZp8mjgHxRu1o2TzkH62aDN+YFS8tlMZQdP4S6yCMhZqgr8oVlAdtuFvDGdsyngHL6Xfw0L31pq4Qy00VCHn04bzlmSzsWQznmw2lJ7XjZ8BbMclfJ0R/CRexS8ZmzJGcb/+nnFcrJ6B1P4lroiJb8I4nhGD8FzhGZwu2Z5OHJtbaFiv7jacvTG6qWSzKdmMZbjX4D78kGnfisvYUtNeN/PDYr8Yxo2iCDiH5ws2ZxN/HY7ob8insKvQtj09Hyz5DOkXLm00rEd3G04iZR3Vn8xpkdKuWjtptW+O1HAyI/Y1vJFpzwYvYUakx17nTjb4Dor0MIH9YgPtpZxJ1emjl6YnGzTMWF2NzbTQ3ZZzRKTfc6KIOSoKiMu4oeRbO36DWK4g62GLGJTbM8LXg2/EKtuIpXRfzthfSe3LOJPub+EpkUr+ERNXTMkT6X6mQcObuLnw/NcWuttyLuGBUoxZkcovZXhWYQA/6efDMh7Hh/ikbcAGTCe+Jf38++M6/Iur7k98ZnWqk/5/oX7SexouJu7edbGF7v/KOYBH8HKGoxJjqqu1cVES5g5s1L+Wh3GbyNvjYm+5oH+QnEq8oxW+t4rSfkqkiDHsFmltvmC3R2zcu5LdbnHm2N9SQxWafJo4iYwwIhbTtCiXF1S/AI3bwqfWnnNeEAepJuRK6VOi4vlZnGmKK+4QPq6JuVmcZ86KBXI+2R6wtoNz4iS+Ik7txYFv0lCFNj45TuJguqL/ZeMxa/eaHhonZ07/C8H1wonE26EFPhKfSK4H9ia+Dh06dOjw/8G/sXcmUir28IcAAAAASUVORK5CYII='
imgstring = imgstring.split('base64,')[-1].strip()
pic = cStringIO.StringIO()
image_string = cStringIO.StringIO(base64.b64decode(imgstring))
image = Image.open(image_string)
image.save('pic.png', image.format, quality = 100)
picture = Image.open('pic.png', mode='r')
picture.load()
picture.seek(0)
print pytesseract.image_to_string(Image.open(picture))
在我看来,我必须以艰难的方式完成这项工作,但即使在保存、加载等之后,我仍然会得到一个AttributeError:read
将这些文件加载到内存中,让pytesseract咀嚼它们最有效的方法是什么?我甚至还没有到达tesseract阶段,我不知道它有多快或多慢,但我有数百万个这样的过程
Traceback (most recent call last):
File "C:\Users\Jeff\Desktop\QS2\tess.py", line 16, in <module>
print pytesseract.image_to_string(Image.open(picture))
File "C:\Python27\lib\site-packages\PIL\Image.py", line 2223, in open
prefix = fp.read(16)
File "C:\Python27\lib\site-packages\PIL\Image.py", line 605, in __getattr__
raise AttributeError(name)
AttributeError: read
回溯(最近一次呼叫最后一次):
文件“C:\Users\Jeff\Desktop\QS2\tess.py”,第16行,在
将pyteseract.image打印到字符串(image.open(picture))
打开文件“C:\Python27\lib\site packages\PIL\Image.py”,第2223行
前缀=fp.read(16)
文件“C:\Python27\lib\site packages\PIL\Image.py”,第605行,在\uuu getattr中__
提升属性错误(名称)
属性错误:读取
PNG的透明度似乎引起了一些问题。覆盖在白色背景上解决了这个问题
from PIL import Image
import base64
import pytesseract
import cStringIO
imgstring = 'data: image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGcAAAAVCAYAAABbq/AzAAAACXBIWXMAAA7EAAAOxAGVKw4bAAADiUlEQVRoge3YTWgeVRQG4IcQJJQSQihBNIQiXYUSpJSgIF1IkSKllCJFQggutBQRQRfFH3RTRFwVERdBRHcuREVEupASRIP4C7VoFSmWSq2gtRGjNm21Ls79+k0mM3cmLrpxXhjm++aec973/sy55w4dOnTo0KFDhw4FLOIe3IWrmWtbwWcfLuDhinj7avzvTjyLDXp24F2cxx84gQMVdgfxHVbwFe5tqaEObXxynDCAQ/g2af+8hjM3ftcwIzoPQ5iouObEQG1INvP4PvlVBT+IYxVxhlJ7Vad6GMICHsQkNqWO/I7Zgt1s6twejGEvfhMLrI2GKjT5NHHCs6l/dyTtM8nnzkL/msbvGhbFTOdwDIfT7514GyN4ryb403glE+8JvN/AWcY8Xi/8P45HK+J+0FJDFZp8mjgHxRu1o2TzkH62aDN+YFS8tlMZQdP4S6yCMhZqgr8oVlAdtuFvDGdsyngHL6Xfw0L31pq4Qy00VCHn04bzlmSzsWQznmw2lJ7XjZ8BbMclfJ0R/CRexS8ZmzJGcb/+nnFcrJ6B1P4lroiJb8I4nhGD8FzhGZwu2Z5OHJtbaFiv7jacvTG6qWSzKdmMZbjX4D78kGnfisvYUtNeN/PDYr8Yxo2iCDiH5ws2ZxN/HY7ob8insKvQtj09Hyz5DOkXLm00rEd3G04iZR3Vn8xpkdKuWjtptW+O1HAyI/Y1vJFpzwYvYUakx17nTjb4Dor0MIH9YgPtpZxJ1emjl6YnGzTMWF2NzbTQ3ZZzRKTfc6KIOSoKiMu4oeRbO36DWK4g62GLGJTbM8LXg2/EKtuIpXRfzthfSe3LOJPub+EpkUr+ERNXTMkT6X6mQcObuLnw/NcWuttyLuGBUoxZkcovZXhWYQA/6efDMh7Hh/ikbcAGTCe+Jf38++M6/Iur7k98ZnWqk/5/oX7SexouJu7edbGF7v/KOYBH8HKGoxJjqqu1cVES5g5s1L+Wh3GbyNvjYm+5oH+QnEq8oxW+t4rSfkqkiDHsFmltvmC3R2zcu5LdbnHm2N9SQxWafJo4iYwwIhbTtCiXF1S/AI3bwqfWnnNeEAepJuRK6VOi4vlZnGmKK+4QPq6JuVmcZ86KBXI+2R6wtoNz4iS+Ik7txYFv0lCFNj45TuJguqL/ZeMxa/eaHhonZ07/C8H1wonE26EFPhKfSK4H9ia+Dh06dOjw/8G/sXcmUir28IcAAAAASUVORK5CYII='
imgstring = imgstring.split('base64,')[-1].strip()
pic = cStringIO.StringIO()
image_string = cStringIO.StringIO(base64.b64decode(imgstring))
image = Image.open(image_string)
# Overlay on white background, see http://stackoverflow.com/a/7911663/1703216
bg = Image.new("RGB", image.size, (255,255,255))
bg.paste(image,image)
print pytesseract.image_to_string(bg)
# Save the image passed to pytesseract for debugging purposes
bg.save('pic.png')
PNG的透明度似乎引起了一些问题。覆盖在白色背景上解决了这个问题
from PIL import Image
import base64
import pytesseract
import cStringIO
imgstring = 'data: image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGcAAAAVCAYAAABbq/AzAAAACXBIWXMAAA7EAAAOxAGVKw4bAAADiUlEQVRoge3YTWgeVRQG4IcQJJQSQihBNIQiXYUSpJSgIF1IkSKllCJFQggutBQRQRfFH3RTRFwVERdBRHcuREVEupASRIP4C7VoFSmWSq2gtRGjNm21Ls79+k0mM3cmLrpxXhjm++aec973/sy55w4dOnTo0KFDhw4FLOIe3IWrmWtbwWcfLuDhinj7avzvTjyLDXp24F2cxx84gQMVdgfxHVbwFe5tqaEObXxynDCAQ/g2af+8hjM3ftcwIzoPQ5iouObEQG1INvP4PvlVBT+IYxVxhlJ7Vad6GMICHsQkNqWO/I7Zgt1s6twejGEvfhMLrI2GKjT5NHHCs6l/dyTtM8nnzkL/msbvGhbFTOdwDIfT7514GyN4ryb403glE+8JvN/AWcY8Xi/8P45HK+J+0FJDFZp8mjgHxRu1o2TzkH62aDN+YFS8tlMZQdP4S6yCMhZqgr8oVlAdtuFvDGdsyngHL6Xfw0L31pq4Qy00VCHn04bzlmSzsWQznmw2lJ7XjZ8BbMclfJ0R/CRexS8ZmzJGcb/+nnFcrJ6B1P4lroiJb8I4nhGD8FzhGZwu2Z5OHJtbaFiv7jacvTG6qWSzKdmMZbjX4D78kGnfisvYUtNeN/PDYr8Yxo2iCDiH5ws2ZxN/HY7ob8insKvQtj09Hyz5DOkXLm00rEd3G04iZR3Vn8xpkdKuWjtptW+O1HAyI/Y1vJFpzwYvYUakx17nTjb4Dor0MIH9YgPtpZxJ1emjl6YnGzTMWF2NzbTQ3ZZzRKTfc6KIOSoKiMu4oeRbO36DWK4g62GLGJTbM8LXg2/EKtuIpXRfzthfSe3LOJPub+EpkUr+ERNXTMkT6X6mQcObuLnw/NcWuttyLuGBUoxZkcovZXhWYQA/6efDMh7Hh/ikbcAGTCe+Jf38++M6/Iur7k98ZnWqk/5/oX7SexouJu7edbGF7v/KOYBH8HKGoxJjqqu1cVES5g5s1L+Wh3GbyNvjYm+5oH+QnEq8oxW+t4rSfkqkiDHsFmltvmC3R2zcu5LdbnHm2N9SQxWafJo4iYwwIhbTtCiXF1S/AI3bwqfWnnNeEAepJuRK6VOi4vlZnGmKK+4QPq6JuVmcZ86KBXI+2R6wtoNz4iS+Ik7txYFv0lCFNj45TuJguqL/ZeMxa/eaHhonZ07/C8H1wonE26EFPhKfSK4H9ia+Dh06dOjw/8G/sXcmUir28IcAAAAASUVORK5CYII='
imgstring = imgstring.split('base64,')[-1].strip()
pic = cStringIO.StringIO()
image_string = cStringIO.StringIO(base64.b64decode(imgstring))
image = Image.open(image_string)
# Overlay on white background, see http://stackoverflow.com/a/7911663/1703216
bg = Image.new("RGB", image.size, (255,255,255))
bg.paste(image,image)
print pytesseract.image_to_string(bg)
# Save the image passed to pytesseract for debugging purposes
bg.save('pic.png')
Python 3的问题*
from PIL import Image
import base64
import pytesseract
import io
imgstring = 'data: image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGcAAAAVCAYAAABbq/AzAAAACXBIWXMAAA7EAAAOxAGVKw4bAAADiUlEQVRoge3YTWgeVRQG4IcQJJQSQihBNIQiXYUSpJSgIF1IkSKllCJFQggutBQRQRfFH3RTRFwVERdBRHcuREVEupASRIP4C7VoFSmWSq2gtRGjNm21Ls79+k0mM3cmLrpxXhjm++aec973/sy55w4dOnTo0KFDhw4FLOIe3IWrmWtbwWcfLuDhinj7avzvTjyLDXp24F2cxx84gQMVdgfxHVbwFe5tqaEObXxynDCAQ/g2af+8hjM3ftcwIzoPQ5iouObEQG1INvP4PvlVBT+IYxVxhlJ7Vad6GMICHsQkNqWO/I7Zgt1s6twejGEvfhMLrI2GKjT5NHHCs6l/dyTtM8nnzkL/msbvGhbFTOdwDIfT7514GyN4ryb403glE+8JvN/AWcY8Xi/8P45HK+J+0FJDFZp8mjgHxRu1o2TzkH62aDN+YFS8tlMZQdP4S6yCMhZqgr8oVlAdtuFvDGdsyngHL6Xfw0L31pq4Qy00VCHn04bzlmSzsWQznmw2lJ7XjZ8BbMclfJ0R/CRexS8ZmzJGcb/+nnFcrJ6B1P4lroiJb8I4nhGD8FzhGZwu2Z5OHJtbaFiv7jacvTG6qWSzKdmMZbjX4D78kGnfisvYUtNeN/PDYr8Yxo2iCDiH5ws2ZxN/HY7ob8insKvQtj09Hyz5DOkXLm00rEd3G04iZR3Vn8xpkdKuWjtptW+O1HAyI/Y1vJFpzwYvYUakx17nTjb4Dor0MIH9YgPtpZxJ1emjl6YnGzTMWF2NzbTQ3ZZzRKTfc6KIOSoKiMu4oeRbO36DWK4g62GLGJTbM8LXg2/EKtuIpXRfzthfSe3LOJPub+EpkUr+ERNXTMkT6X6mQcObuLnw/NcWuttyLuGBUoxZkcovZXhWYQA/6efDMh7Hh/ikbcAGTCe+Jf38++M6/Iur7k98ZnWqk/5/oX7SexouJu7edbGF7v/KOYBH8HKGoxJjqqu1cVES5g5s1L+Wh3GbyNvjYm+5oH+QnEq8oxW+t4rSfkqkiDHsFmltvmC3R2zcu5LdbnHm2N9SQxWafJo4iYwwIhbTtCiXF1S/AI3bwqfWnnNeEAepJuRK6VOi4vlZnGmKK+4QPq6JuVmcZ86KBXI+2R6wtoNz4iS+Ik7txYFv0lCFNj45TuJguqL/ZeMxa/eaHhonZ07/C8H1wonE26EFPhKfSK4H9ia+Dh06dOjw/8G/sXcmUir28IcAAAAASUVORK5CYII='
imgstring = imgstring.split('base64,')[-1].strip()
pic = io.StringIO()
image_string = io.BytesIO(base64.b64decode(imgstring))
image = Image.open(image_string)
# Overlay on white background, see http://stackoverflow.com/a/7911663/1703216
bg = Image.new("RGB", image.size, (255,255,255))
bg.paste(image,image)
print(pytesseract.image_to_string(bg))
# Save the image passed to pytesseract for debugging purposes
bg.save('pic.png')
Python 3的问题*
from PIL import Image
import base64
import pytesseract
import io
imgstring = 'data: image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGcAAAAVCAYAAABbq/AzAAAACXBIWXMAAA7EAAAOxAGVKw4bAAADiUlEQVRoge3YTWgeVRQG4IcQJJQSQihBNIQiXYUSpJSgIF1IkSKllCJFQggutBQRQRfFH3RTRFwVERdBRHcuREVEupASRIP4C7VoFSmWSq2gtRGjNm21Ls79+k0mM3cmLrpxXhjm++aec973/sy55w4dOnTo0KFDhw4FLOIe3IWrmWtbwWcfLuDhinj7avzvTjyLDXp24F2cxx84gQMVdgfxHVbwFe5tqaEObXxynDCAQ/g2af+8hjM3ftcwIzoPQ5iouObEQG1INvP4PvlVBT+IYxVxhlJ7Vad6GMICHsQkNqWO/I7Zgt1s6twejGEvfhMLrI2GKjT5NHHCs6l/dyTtM8nnzkL/msbvGhbFTOdwDIfT7514GyN4ryb403glE+8JvN/AWcY8Xi/8P45HK+J+0FJDFZp8mjgHxRu1o2TzkH62aDN+YFS8tlMZQdP4S6yCMhZqgr8oVlAdtuFvDGdsyngHL6Xfw0L31pq4Qy00VCHn04bzlmSzsWQznmw2lJ7XjZ8BbMclfJ0R/CRexS8ZmzJGcb/+nnFcrJ6B1P4lroiJb8I4nhGD8FzhGZwu2Z5OHJtbaFiv7jacvTG6qWSzKdmMZbjX4D78kGnfisvYUtNeN/PDYr8Yxo2iCDiH5ws2ZxN/HY7ob8insKvQtj09Hyz5DOkXLm00rEd3G04iZR3Vn8xpkdKuWjtptW+O1HAyI/Y1vJFpzwYvYUakx17nTjb4Dor0MIH9YgPtpZxJ1emjl6YnGzTMWF2NzbTQ3ZZzRKTfc6KIOSoKiMu4oeRbO36DWK4g62GLGJTbM8LXg2/EKtuIpXRfzthfSe3LOJPub+EpkUr+ERNXTMkT6X6mQcObuLnw/NcWuttyLuGBUoxZkcovZXhWYQA/6efDMh7Hh/ikbcAGTCe+Jf38++M6/Iur7k98ZnWqk/5/oX7SexouJu7edbGF7v/KOYBH8HKGoxJjqqu1cVES5g5s1L+Wh3GbyNvjYm+5oH+QnEq8oxW+t4rSfkqkiDHsFmltvmC3R2zcu5LdbnHm2N9SQxWafJo4iYwwIhbTtCiXF1S/AI3bwqfWnnNeEAepJuRK6VOi4vlZnGmKK+4QPq6JuVmcZ86KBXI+2R6wtoNz4iS+Ik7txYFv0lCFNj45TuJguqL/ZeMxa/eaHhonZ07/C8H1wonE26EFPhKfSK4H9ia+Dh06dOjw/8G/sXcmUir28IcAAAAASUVORK5CYII='
imgstring = imgstring.split('base64,')[-1].strip()
pic = io.StringIO()
image_string = io.BytesIO(base64.b64decode(imgstring))
image = Image.open(image_string)
# Overlay on white background, see http://stackoverflow.com/a/7911663/1703216
bg = Image.new("RGB", image.size, (255,255,255))
bg.paste(image,image)
print(pytesseract.image_to_string(bg))
# Save the image passed to pytesseract for debugging purposes
bg.save('pic.png')
你能给我们完整的追踪吗?我目前不知道是什么导致属性错误。添加了回溯。这是否意味着该代码适用于您?我听说PIL在其Windows安装中存在问题,如果这是问题的话,我会很感兴趣。这里的
picture
不是已经是PIL.Image对象了吗?Image.open()的参数是一个文件对象或字符串。也许将第16行更改为打印pytesseract.image\u到\u字符串(图片)
。我首先尝试了这一点-在这一行中,我得到了WindowsError:[错误2]系统找不到指定的文件
-完全回溯太长。保存并打开该文件的优点是,我可以确保该文件存在并且看起来很好。pic.png文件实际上是正确目录中的一个电话号码png。Pytesseract在调用tesseract ocr之前将临时图像写入磁盘。如果效率是一个问题,你可能会有更多的运气,比如声称不写任何临时文件。你能给我们完整的回溯吗?我目前不知道是什么导致属性错误。添加了回溯。这是否意味着该代码适用于您?我听说PIL在其Windows安装中存在问题,如果这是问题的话,我会很感兴趣。这里的picture
不是已经是PIL.Image对象了吗?Image.open()的参数是一个文件对象或字符串。也许将第16行更改为打印pytesseract.image\u到\u字符串(图片)
。我首先尝试了这一点-在这一行中,我得到了WindowsError:[错误2]系统找不到指定的文件
-完全回溯太长。保存并打开该文件的优点是,我可以确保该文件存在并且看起来很好。pic.png文件实际上是正确目录中的一个电话号码png。Pytesseract在调用tesseract ocr之前将临时图像写入磁盘。如果效率是一个问题,那么你可能会更幸运地得到这样的东西,即声称不写任何临时文件。粘贴这一点,我仍然得到WindowsError:[错误2]系统找不到指定的文件
我在Ubuntu上测试了它。我建议先测试和的基本用法示例。PyteSeract只是tesseract ocr的一个薄薄包装,因此没有tesseract ocr就无法工作。我未能正确地将tesseract ocr程序文件添加到系统的path变量中。准确地粘贴它,我仍然会得到WindowsError:[错误2]系统找不到指定的文件
我在Ubuntu上测试了它。我建议先测试和的基本用法示例。PyteSeract只是tesseract ocr的一个薄包装,因此没有tesseract ocr就无法工作。我未能将tesseract ocr程序文件正确添加到系统的path变量中。