Python tesseract未拾取页面右侧的字符_Python_Ocr_Tesseract_Python Tesseract

Python tesseract未拾取页面右侧的字符

python

Python tesseract未拾取页面右侧的字符,python,ocr,tesseract,python-tesseract,Python,Ocr,Tesseract,Python Tesseract,在pdf页面中循环时，tesseract识别一个页面上的字符，类似于： Table 1 Summary Data 3 Table 2 Unique Data 5 但在另一页上 Table 3 Reservoir Data 8 Table 4 Surface Data 9 它会删除最后的数字，因此输出与 Table 3 Reservoir Da

在pdf页面中循环时，tesseract识别一个页面上的字符，类似于：

Table 1 Summary Data                    3
Table 2 Unique  Data                    5

但在另一页上

Table 3  Reservoir Data                 8
Table 4  Surface Data                   9

它会删除最后的数字，因此输出与

Table 3  Reservoir Data                
Table 4  Surface Data

数字8和9没有被解释。我检查了从pdf2image创建的图像

pages = convert_from_path(pdf_path, 500)

最右边的文本出现在页面图像中

但是，下面代码中的数据帧（df）不包含所讨论页面的任何最右侧数据，也不包含任何看起来像识别的字符。pdf页面和图像质量相同，最右边的字符位于同一水平位置

这是我正在使用的代码：

    custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
    for pdf_path in pdfs:
        pages = convert_from_path(pdf_path, 500)

        for pageNum,imgBlob in enumerate(pages):
            if pageNum < 8:
                if pageNum == 6:
                    d = pytesseract.image_to_data(imgBlob, config=custom_config, output_type=Output.DICT)
                    df = pd.DataFrame(d)

                    print(pageNum)
                    print(df)

custom\u config=r'-c preserve\u interword\u spaces=1--oem 1--psm 1-l eng+ita'
对于pdf中的pdf_路径：
pages=从路径转换路径（pdf路径，500）
对于pageNum，枚举中的imgBlob（页）：
如果pageNum<8：
如果pageNum==6：
d=pytesseract.image\u to\u数据（imgBlob，config=custom\u config，output\u type=output.DICT）
df=pd.数据帧（d）
打印（pageNum）
打印（df）

我想知道是否有一个水平极限或边距，tesseract无法读取超过，并将dpi更改为400-我假设500是dpi。在谷歌搜索诸如剪切、页边距或跳过之类的术语时，我没有发现任何相关内容。

检查使用不同的页面分割模式是否能产生更好的结果

custom\u config=r'-c preserve\u interword\u spaces=1--oem 1--psm 6-l eng+ita'

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

我在tesseract4上也遇到过同样的问题，而@K41F4r的解决方案在页面分割模式下的值为12（带有OSD的稀疏文本）。这是页面分割模式的问题psm 3无法检测图像中的稀疏字符。使用psm 6、11或12