在Python中使用Tesseract提取表数据[OCR] 问题的背景
嘿,伙计们!我正在为某人开发一个OCR实现,共有4个模板。我可以为他们三个这样做,没问题。但这个模板真的让我着迷。我尝试了一切,在Python中使用Tesseract提取表数据[OCR] 问题的背景,python,opencv,ocr,tesseract,data-extraction,Python,Opencv,Ocr,Tesseract,Data Extraction,嘿,伙计们!我正在为某人开发一个OCR实现,共有4个模板。我可以为他们三个这样做,没问题。但这个模板真的让我着迷。我尝试了一切,deskew,opencv,homography和其他工具来修复对齐。我甚至尝试了很多其他的方法,比如dialate和侵蚀。在这个PDF文件数据集上,我将其转换为jpeg文件以提取所需的文本 问题 这些是我需要在上面进行OCR的模板的一些图像:) 输出[使用简单的TESSERACT] 对齐脚本 预处理脚本 最好的方法是放手 我尝试了很多方法来解决这个问题,但这可以通过
deskew
,opencv
,homography
和其他工具来修复对齐。我甚至尝试了很多其他的方法,比如dialate
和侵蚀
。在这个PDF文件数据集上,我将其转换为jpeg
文件以提取所需的文本
问题
这些是我需要在上面进行OCR的模板的一些图像:)
输出[使用简单的TESSERACT]
对齐脚本
预处理脚本
最好的方法是放手
我尝试了很多方法来解决这个问题,但这可以通过使用简单的参数来实现
deskew --output input.png input.jpg
结果[壮观]
简单的提取方法
工作流程
制作一个功能,通过拆分每行来检测13行条形码,并将其保存在列表中李>
然后创建一个函数,通过执行barcode\u列表来删除“”和“|”。删除(“|”)
和barcode\u列表。删除(“”)
对每一行执行简单的列表操作<代码>条形码行[-1]
是总数,
类似地,条形码行[1]
是我们的条形码
.csv
文件或.sqlite
数据库中import cv2
import numpy as np
import math
import sys
from PIL import Image
# Get the image files from the command line arguments
# These are full paths to the images
# image2 will be warped to match image1
# argv[0] is name of script
image1 = sys.argv[1]
image2 = sys.argv[2]
outfile = sys.argv[3]
# Read the images to be aligned
# im2 is to be warped to match im1
im1 = cv2.imread(image1)
im2 = cv2.imread(image2)
# Convert images to grayscale for computing the rotation via ECC method
im1_gray = cv2.cvtColor(im1,cv2.COLOR_BGR2GRAY)
im2_gray = cv2.cvtColor(im2,cv2.COLOR_BGR2GRAY)
# Find size of image1
sz = im1.shape
# Define the motion model - euclidean is rigid (SRT)
warp_mode = cv2.MOTION_EUCLIDEAN
# Define 2x3 matrix and initialize the matrix to identity matrix I (eye)
warp_matrix = np.eye(2, 3, dtype=np.float32)
# Specify the number of iterations.
number_of_iterations = 5000
# Specify the threshold of the increment
# in the correlation coefficient between two iterations
termination_eps = 1e-3
# Define termination criteria
criteria = (cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, number_of_iterations, termination_eps)
# Run the ECC algorithm. The results are stored in warp_matrix.
(cc, warp_matrix) = cv2.findTransformECC(im1_gray, im2_gray, warp_matrix, warp_mode, criteria)
# Warp im2 using affine
im2_aligned = cv2.warpAffine(im2, warp_matrix, (sz[1],sz[0]), flags=cv2.INTER_LINEAR + cv2.WARP_INVERSE_MAP)
# Print rotation angle
row1_col0 = warp_matrix[0,1]
angle = math.degrees(math.asin(row1_col0))
print(angle)
colorImage = Image.open("./0.jpg")
# Rotate it by 45 degrees
rotated = colorImage.rotate(int(angle))
rotated.show()
# Automatic Alignment using Homography and Pre-Processing :)
# IN CASE DESKEW FAILS
import cv2
import numpy as np
import sys
import os
from colorama import Fore, Back
import time
def pre_processing(img,out_img):
print(Fore.RED)
os.system('pyfiglet pre-pro -f poison')
print(Fore.GREEN+Back.RESET+"[+] Pre-Processing Toolkit by Muneeb Ahmad")
print('')
print(Fore.GREEN + '[+] STARTING OPENCV IMAGE PRE-PROCESSING [!]')
print(Fore.RED +'[+] APPLYING BINARIZATION [METHOD:OTSU] [!]')
print(Fore.GREEN+Back.RESET)
img = cv2.imread(filename=img)
img = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
imgfinal = cv2.fastNlMeansDenoising(img)
kernel = np.ones((1, 1), np.uint8)
imgfinal = cv2.erode(imgfinal,kernel=kernel,iterations=2)
imgfinal = cv2.cvtColor(imgfinal,cv2.COLOR_BGR2GRAY)
cv2.imwrite(filename='output.jpg',img=imgfinal)
if __name__ == '__main__':
pre_processing('0.jpg', 'output.png')
deskew --output input.png input.jpg
run the pre_processing function ...
tesseract input.png input.txt --psm 6
|401078 | 6161108006029 | BIO WHOLE FRESH MILK 2L 1 PCS 60.00 PCS 251.10 15,066.00
|400242 | 6161108006012 | BIO WHOLE MILK 1LTR 1 PCS 24.00 PCS 130.50 3,132.00
|400985 | 6161108000812 | BIO WHOLE MILK LONG LIFE S5OOML 1 PCS 24.00 PCS 67.40 1,617.58
|400833 | 6161108005039 | BIO YOG NATURE 450ML 1 PCS 12.00 PCS 266.37 3,196.45
|400700 | 6161108000027 | BIO YOG NATURE PLAIN 150ML 1 PCS 24.00 PCS 91.01 2,184.34
|400364 | 6161108000058 | BIO YOG S/BERRY 150ML CUP 1 PCS 36.00 PCS 91.02 3,276.54
|400365 | 6161108000119 | BIO YOG VANILLA 15OML CUP 1 PCS 48.00 PCS 91.02 4,368.72
|400839 | 6161105384663 | DAIMA YOG SLURRP PASSION 1L 1 PCS 3.00 PCS 208.05 624.15
|400828 | 6161108005169 | BIO YOGHURT MANGO 90ML 1 PCS 12.00 PCS 60.96 731.46
|400823 | 6161108005015 | BIO YOGHURT NATURE PLAIN 90ML 1 PCS 24.00 PCS 60.96 1,462.92
|400825 | 6161108005190 | BIO YOGHURT PEACH 90ML 1 PCS 12.00 PCS 60.96 731.46
|400826 | 6161108005046 | BIO YOGHURT STRAWBERRY 90ML 1 PCS 12.00 PCS 60.96 731.46
|400827 | 6161108005121 | BIO YOGHURT VANILLA 450ML 1 PCS 6.00 PCS 266.37 1,598.23
|400822 | 6161108005107 | BIO YOGHURT VANILLA 90ML 1 PCS 12.00 PCS 60.96 731.46