图像到文本-删除python 2.7中的非ascii字符_Python_Image Processing_Ocr_Tesseract_Python Tesseract

图像到文本-删除python 2.7中的非ascii字符

python image-processing

图像到文本-删除python 2.7中的非ascii字符,python,image-processing,ocr,tesseract,python-tesseract,Python,Image Processing,Ocr,Tesseract,Python Tesseract,我正在使用pytesser对小图像进行OCR，并从中获取字符串： image= Image.open(ImagePath) text = image_to_string(image) print text 然而，pytesser有时喜欢识别并返回非ascii字符。当我现在想要打印我刚刚识别的内容时，问题就出现了。在Python2.7（我正在使用的）中，程序崩溃有没有办法使pytesser不返回任何非ascii字符？也许您可以在tesseract OCR中更改某些内容或者，是否有某种方法可以

我正在使用pytesser对小图像进行OCR，并从中获取字符串：

image= Image.open(ImagePath)
text = image_to_string(image)
print text

然而，pytesser有时喜欢识别并返回非ascii字符。当我现在想要打印我刚刚识别的内容时，问题就出现了。在Python2.7（我正在使用的）中，程序崩溃

有没有办法使pytesser不返回任何非ascii字符？也许您可以在tesseract OCR中更改某些内容

或者，是否有某种方法可以测试字符串中的非ascii字符（而不会使程序崩溃），然后不打印该行

有些人会建议使用python 3.4，但根据我的研究，pytesser似乎无法使用它：

我同意。此库将非ASCII字符转换为最相似的ASCII表示形式

import unidecode
image = Image.open(ImagePath)
text = image_to_string(image)
print unidecode(text)

它应该工作得很好

有没有办法使pytesser不返回任何非ascii字符

您可以使用选项

tesseradit\u char\u whitelist

来限制tesseract可识别的字符

例如：

import string
char_whitelist = string.digits
char_whitelist += string.ascii_lowercase
char_whitelist += string.ascii_uppercase
image= Image.open(ImagePath)
text = image_to_string(image,
    config="-c tessedit_char_whitelist=%s_-." % char_whitelist)
print text

另请参见：

或者，如果用户想要删除unicode，他们可以遵循以下帖子：was giving a TypeError:“module”对象不可调用。做了一个小小的改变<代码>从unidecode导入unidecode