Python 3.x 多页Tiff图像的PyteSeract错误_Python 3.x_Python Tesseract

Python 3.x 多页Tiff图像的PyteSeract错误

python-3.x

Python 3.x 多页Tiff图像的PyteSeract错误,python-3.x,python-tesseract,Python 3.x,Python Tesseract,当我读入一个15页的多页Tiff图像，该图像是一个白色背景的黑色字母/单词文档时，PyTesseract在我循环页面并转换为字符串的步骤中抛出一个“OSError:-9”错误我使用pytesseract包和pyocr.builders。单个页面似乎工作正常，但我相信当图像不在RGB中时，程序会转换为RGB img = Image.open(r'\users\ai\text.tiff') img.load() txt = "" for frame in range(0, img.n_frames

当我读入一个15页的多页Tiff图像，该图像是一个白色背景的黑色字母/单词文档时，PyTesseract在我循环页面并转换为字符串的步骤中抛出一个“OSError:-9”错误

我使用pytesseract包和pyocr.builders。单个页面似乎工作正常，但我相信当图像不在RGB中时，程序会转换为RGB

img = Image.open(r'\users\ai\text.tiff')
img.load()
txt = ""
for frame in range(0, img.n_frames):
    img.seek(frame)
    txt += tool.image_to_string(img,builder=pyocr.builders.TextBuilder())

预期输出是jupyter窗口中显示的所有15页

错误消息

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-17-e59bdf3b773c> in <module>
      2 for frame in range(0, img.n_frames):
      3     img.seek(frame)
----> 4     txt += tool.image_to_string(img,builder=pyocr.builders.TextBuilder())
      5 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyocr\tesseract.py in image_to_string(image, lang, builder)
    357     with tempfile.TemporaryDirectory() as tmpdir:
    358         if image.mode != "RGB":
--> 359             image = image.convert("RGB")
    360         image.save(os.path.join(tmpdir, "input.bmp"))
    361         (status, errors) = run_tesseract("input.bmp", "output", cwd=tmpdir,

~\AppData\Local\Continuum\anaconda3\lib\site-packages\PIL\Image.py in convert(self, mode, matrix, dither, palette, colors)
    932         """
    933 
--> 934         self.load()
    935 
    936         if not mode and self.mode == "P":

~\AppData\Local\Continuum\anaconda3\lib\site-packages\PIL\TiffImagePlugin.py in load(self)
   1097     def load(self):
   1098         if self.use_load_libtiff:
-> 1099             return self._load_libtiff()
   1100         return super(TiffImageFile, self).load()
   1101 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\PIL\TiffImagePlugin.py in _load_libtiff(self)
   1189 
   1190         if err < 0:
-> 1191             raise IOError(err)
   1192 
   1193         return Image.Image.load(self)

OSError: -9

---------------------------------------------------------------------------
OSError回溯（最近一次调用上次）
在里面
2用于范围内的帧（0，img.n_帧）：
3图像搜索（帧）
---->4 txt+=tool.image\u to\u字符串（img，builder=pyocr.builders.TextBuilder（））
5.
~\AppData\Local\Continuum\anaconda3\lib\site packages\pyocr\tesseract.py（图像、语言、生成器）
357，tempfile.TemporaryDirectory（）作为tmpdir：
358如果image.mode！=“RGB”：
-->359 image=image.convert（“RGB”）
360 image.save（os.path.join（tmpdir，“input.bmp”））
361（状态，错误）=运行_tesseract（“input.bmp”，“output”，cwd=tmpdir，
~\AppData\Local\Continuum\anaconda3\lib\site packages\PIL\Image.py处于转换状态（self、mode、matrix、抖动、调色板、颜色）
932         """
933
-->934 self.load（）
935
936如果非模式和self.mode==“P”：
~\AppData\Local\Continuum\anaconda3\lib\site packages\PIL\TiffImagePlugin.py正在加载（self）
1097 def加载（自）：
1098如果自行使用\u加载\u libtiff：
->1099返回自。_加载_libtiff（）
1100返回超级（TiffImageFile，self）.load（）
1101
~\AppData\Local\Continuum\anaconda3\lib\site packages\PIL\TiffImagePlugin.py in\u load\u libtiff（self）
1189
1190如果误差<0：
->1191 raise IOError（错误）
1192
1193返回映像.映像.加载（自）
错误：-9

对于这样的问题，您应该提供一个，因为有一些代码遗漏。此外，您还应该提供您的测试图像。不过，在本例中，您不能附加多页TIFF，因此最好有一个指向TIFF的链接

我能从中找到。这是一个10页的TIFF

下面是一个使用pyocr的解决方案：

from PIL import Image

import pytesseract
import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
tool = tools[0]

# pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'


image = Image.open('multipage_tiff_example.tif')

config = ("--psm 6")

txt = ''
for frame in range(image.n_frames):
    image.seek(frame)
    txt = tool.image_to_string(image, builder=pyocr.builders.TextBuilder())
    print(txt)

下面是一个使用pytesseract的解决方案：

from PIL import Image
import pytesseract

# pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

image = Image.open('multipage_tiff_example.tif')

config = ("--psm 6")

txt = ''
for frame in range(image.n_frames):
    image.seek(frame)
    txt += pytesseract.image_to_string(image, config = config, lang='eng') + '\n'

print(txt)

两者都提供以下输出：

Multipage
TIFF
Example
Page 1
Multipage
TIFF
Example
Page 2
Multipage
TIFF
Example
Page 3
Multipage
TIFF
Example
Page 4
Multipage
TIFF
Example
Page5
Multipage
TIFF
Example
Page 6
Multipage
TIFF
Example
Page /
Multipage
TIFF
Example
Page 8
Multipage
TIFF
Example
Page 9
Multipage
TIFF

Example

Page 10

这确实有效，感谢您的帮助。我看到的一个问题是，我的Tiff图像第一页上有一些签名和印章，这导致了一个问题。我需要找出一种方法，如何只保留文本细节就可以保留该页。您可以尝试将第一个Tiff页面作为带有一些代码的标记为

opencv

的问题发布你可能会对形态学等技巧有一些好的想法。