在Python中使用Tesseract提取表数据[OCR] 问题的背景_Python_Opencv_Ocr_Tesseract_Data Extraction

在Python中使用Tesseract提取表数据[OCR] 问题的背景

python opencv

在Python中使用Tesseract提取表数据[OCR] 问题的背景,python,opencv,ocr,tesseract,data-extraction,Python,Opencv,Ocr,Tesseract,Data Extraction,嘿，伙计们！我正在为某人开发一个OCR实现，共有4个模板。我可以为他们三个这样做，没问题。但这个模板真的让我着迷。我尝试了一切，deskew，opencv，homography和其他工具来修复对齐。我甚至尝试了很多其他的方法，比如dialate和侵蚀。在这个PDF文件数据集上，我将其转换为jpeg文件以提取所需的文本问题这些是我需要在上面进行OCR的模板的一些图像：）输出[使用简单的TESSERACT] 对齐脚本预处理脚本最好的方法是放手我尝试了很多方法来解决这个问题，但这可以通过

嘿，伙计们！我正在为某人开发一个OCR实现，共有4个模板。我可以为他们三个这样做，没问题。但这个模板真的让我着迷。我尝试了一切，

deskew

，

opencv

，

homography

和其他工具来修复对齐。我甚至尝试了很多其他的方法，比如

dialate

和

侵蚀

。在这个PDF文件数据集上，我将其转换为

jpeg

文件以提取所需的文本

问题 这些是我需要在上面进行OCR的模板的一些图像：）

输出[使用简单的TESSERACT] 对齐脚本预处理脚本

最好的方法是放手我尝试了很多方法来解决这个问题，但这可以通过使用简单的

参数来实现
deskew --output input.png input.jpg

结果[壮观]
简单的提取方法
工作流程
制作一个功能，通过拆分每行来检测13行条形码，并将其保存在列表中
然后创建一个函数，通过执行barcode\u列表来删除“”和“|”。删除（“|”）
和barcode\u列表。删除（“”）
对每一行执行简单的列表操作<代码>条形码行[-1]

是总数，类似地，

条形码行[1]

是

我们的条形码

将其保存在

.csv

文件或

.sqlite

数据库中

就在这里。解决复杂问题的最佳方法

有趣的事实当我这么做的时候，我注意到了一件事，很多新手都会注意到。旋转不良的图像经过预处理后，比您在这里看到的对齐但未经预处理的图像效果更好。这只是为了说明一件事<该死的儿子！预处理OP

import cv2
import numpy as np
import math
import sys
from PIL import Image
# Get the image files from the command line arguments
# These are full paths to the images
# image2 will be warped to match image1
# argv[0] is name of script
image1 = sys.argv[1]
image2 = sys.argv[2]
outfile = sys.argv[3]

# Read the images to be aligned
# im2 is to be warped to match im1
im1 =  cv2.imread(image1)
im2 =  cv2.imread(image2)

# Convert images to grayscale for computing the rotation via ECC method
im1_gray = cv2.cvtColor(im1,cv2.COLOR_BGR2GRAY)
im2_gray = cv2.cvtColor(im2,cv2.COLOR_BGR2GRAY)

# Find size of image1
sz = im1.shape

# Define the motion model - euclidean is rigid (SRT)
warp_mode = cv2.MOTION_EUCLIDEAN

# Define 2x3 matrix and initialize the matrix to identity matrix I (eye)
warp_matrix = np.eye(2, 3, dtype=np.float32)

# Specify the number of iterations.
number_of_iterations = 5000

# Specify the threshold of the increment
# in the correlation coefficient between two iterations
termination_eps = 1e-3

# Define termination criteria
criteria = (cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, number_of_iterations,  termination_eps)

# Run the ECC algorithm. The results are stored in warp_matrix.
(cc, warp_matrix) = cv2.findTransformECC(im1_gray, im2_gray, warp_matrix, warp_mode, criteria)

# Warp im2 using affine
im2_aligned = cv2.warpAffine(im2, warp_matrix, (sz[1],sz[0]), flags=cv2.INTER_LINEAR + cv2.WARP_INVERSE_MAP)

# Print rotation angle
row1_col0 = warp_matrix[0,1]
angle = math.degrees(math.asin(row1_col0))
print(angle)

colorImage = Image.open("./0.jpg")

# Rotate it by 45 degrees

rotated = colorImage.rotate(int(angle))
rotated.show()

# Automatic Alignment using Homography and Pre-Processing :)
# IN CASE DESKEW FAILS
import cv2
import numpy as np
import sys
import os
from colorama import Fore, Back
import time

def pre_processing(img,out_img):
    print(Fore.RED)
    os.system('pyfiglet pre-pro -f poison')
    print(Fore.GREEN+Back.RESET+"[+] Pre-Processing Toolkit by Muneeb Ahmad")
    print('')
    print(Fore.GREEN + '[+] STARTING OPENCV IMAGE PRE-PROCESSING [!]')
    print(Fore.RED +'[+] APPLYING BINARIZATION [METHOD:OTSU] [!]')
    print(Fore.GREEN+Back.RESET)
    img = cv2.imread(filename=img)
    img = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
    imgfinal = cv2.fastNlMeansDenoising(img)
    kernel = np.ones((1, 1), np.uint8)
    imgfinal = cv2.erode(imgfinal,kernel=kernel,iterations=2)
    imgfinal = cv2.cvtColor(imgfinal,cv2.COLOR_BGR2GRAY)
    cv2.imwrite(filename='output.jpg',img=imgfinal)
if __name__ == '__main__':
    pre_processing('0.jpg', 'output.png')

deskew --output input.png input.jpg

run the pre_processing function ...

tesseract input.png input.txt --psm 6


|401078 | 6161108006029 | BIO WHOLE FRESH MILK 2L 1 PCS 60.00 PCS 251.10 15,066.00
|400242 | 6161108006012 | BIO WHOLE MILK 1LTR 1 PCS 24.00 PCS 130.50 3,132.00
|400985 | 6161108000812 | BIO WHOLE MILK LONG LIFE S5OOML 1 PCS 24.00 PCS 67.40 1,617.58
|400833 | 6161108005039 | BIO YOG NATURE 450ML 1 PCS 12.00 PCS 266.37 3,196.45
|400700 | 6161108000027 | BIO YOG NATURE PLAIN 150ML 1 PCS 24.00 PCS 91.01 2,184.34
|400364 | 6161108000058 | BIO YOG S/BERRY 150ML CUP 1 PCS 36.00 PCS 91.02 3,276.54
|400365 | 6161108000119 | BIO YOG VANILLA 15OML CUP 1 PCS 48.00 PCS 91.02 4,368.72
|400839 | 6161105384663 | DAIMA YOG SLURRP PASSION 1L 1 PCS 3.00 PCS 208.05 624.15
|400828 | 6161108005169 | BIO YOGHURT MANGO 90ML 1 PCS 12.00 PCS 60.96 731.46
|400823 | 6161108005015 | BIO YOGHURT NATURE PLAIN 90ML 1 PCS 24.00 PCS 60.96 1,462.92
|400825 | 6161108005190 | BIO YOGHURT PEACH 90ML 1 PCS 12.00 PCS 60.96 731.46
|400826 | 6161108005046 | BIO YOGHURT STRAWBERRY 90ML 1 PCS 12.00 PCS 60.96 731.46
|400827 | 6161108005121 | BIO YOGHURT VANILLA 450ML 1 PCS 6.00 PCS 266.37 1,598.23
|400822 | 6161108005107 | BIO YOGHURT VANILLA 90ML 1 PCS 12.00 PCS 60.96 731.46