Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/opencv/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
从图像中删除扫描工件,只留下文本(OpenCV+;Python)_Python_Opencv_Ocr - Fatal编程技术网

从图像中删除扫描工件,只留下文本(OpenCV+;Python)

从图像中删除扫描工件,只留下文本(OpenCV+;Python),python,opencv,ocr,Python,Opencv,Ocr,我正在尝试编写一个Python脚本,在使用Tesseract处理扫描图像之前,它将“清理”扫描图像。除了文本之外,这些图像还有一些灰尘、扫描伪影、页面边缘的奇怪线条等等 到目前为止,这是我所拥有的。它尝试使用cv2.ConnectedComponentsWithStats去除一点灰尘,使用形态结构元素去除水平线和垂直线,然后尝试将图像裁剪为文本。它总比什么都没有好,因为它确实消除了一些噪音,但有时也会删除实际文本,并在页面边距处留下一些行: image = cv2.imread(path, 0)

我正在尝试编写一个Python脚本,在使用Tesseract处理扫描图像之前,它将“清理”扫描图像。除了文本之外,这些图像还有一些灰尘、扫描伪影、页面边缘的奇怪线条等等

到目前为止,这是我所拥有的。它尝试使用cv2.ConnectedComponentsWithStats去除一点灰尘,使用形态结构元素去除水平线和垂直线,然后尝试将图像裁剪为文本。它总比什么都没有好,因为它确实消除了一些噪音,但有时也会删除实际文本,并在页面边距处留下一些行:

image = cv2.imread(path, 0)
logging.info('Opening image ' + path)
logging.info('Converting to grayscale...')
_, blackAndWhite = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY_INV)
# Find and exclude small elements
logging.info('Removing small dotted regions (dust, etc.)...')
nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(blackAndWhite, None, None, None, 8, cv2.CV_32S)
sizes = stats[1:, -1] #get CC_STAT_AREA component
img2 = np.zeros((labels.shape), np.uint8)
for i in range(0, nlabels - 1):
    if sizes[i] >= 40:   #filter small dotted regions
        img2[labels == i + 1] = 255
image = cv2.bitwise_not(img2)
cv2.imwrite(out_filename, image)
logging.info('Writing the modified image...')
# ------ START CROPPING ----- #
image = cv2.imread(out_filename)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Load image, grayscale, Gaussian blur, Otsu's threshold
blur = cv2.GaussianBlur(gray, (5,5), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
logging.info('Applying Otsu\'s Threshold')

horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (25,4))
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,32))
detected_lines = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
detected_vlines = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)

for l in [detected_lines, detected_vlines]:
    cnts = cv2.findContours(l, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    for c in cnts:
        cv2.drawContours(thresh, [c], -1, (0,0,0), 50)
        cv2.drawContours(image, [c], -1, (255,255,255), 50)

# Create rectangular structuring element and dilate
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (18,18))
dilate = cv2.dilate(thresh, kernel, iterations=4)
logging.info('Dilating text regions')

try:
    # Find contours and draw rectangle
    cnts, hierarchy = cv2.findContours(dilate, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
    logging.info('Extracting contours')
    # Search for contours and append their coordinates into an array
    arr = []
    for i,c in enumerate(cnts):
        # Exclude small elements
        x,y,w,h = cv2.boundingRect(c)
        # Exclude oddly shaped elements
        if w/h > 8 or h/w > 1.6:
            continue
        arr.append((x,y))
        arr.append((x+w,y+h))
    # Calculate the coordinates and crop the image
    logging.info('Cropping the image')
    x,y,w,h = cv2.boundingRect(np.asarray(arr))
    image = image[y:y+h,x:x+w]
    if debug:
        logging.info('Showing the image (press "q" to continue)')
        label = "STAGE FOUR: CROPPED IMAGE"
    logging.info('Writing to ' + out_filename)
except cv2.error:
    pass
cv2.imwrite(out_filename, image)

我对图像处理相当陌生,没有很多经验。希望听到一些关于如何改进算法的建议

我会首先调用整个图像上的
pytesseract.image\u到\u data()
。这将为您提供所有检测到的单词(包括页面边缘的无效字符)的位置和OCR置信度。然后根据单词在高置信度下的位置确定包含有效文本的区域。最后,在该区域上使用
pytesseract.image_to_string()
,以获取文本(或从已有的
pytesseract.image_to_data()
中过滤结果)

这种方法适用于给定的示例。如果你想去除灰尘,你可以研究“椒盐噪声过滤”,但这似乎是没有必要的

import cv2
import pandas as pd
import pytesseract
from io import StringIO

# Obtain OCR data
img_bgr = cv2.imread("XVePx.jpg")
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
ocr_data = pytesseract.image_to_data(img_rgb, lang="deu")
ocr_df = pd.read_table(StringIO(ocr_data), quoting=3)

# Determine the text region based on the words (2+ characters) of high confidence (>90%)
confident_words_df = ocr_df[
    (ocr_df["conf"] > 90)
    & (ocr_df["text"].str.len() - ocr_df["text"].str.count(" ") > 1)
]
top = confident_words_df["top"].min()
left = confident_words_df["left"].min()
bot = (confident_words_df["top"] + confident_words_df["height"]).max()
right = (confident_words_df["left"] + confident_words_df["width"]).max()

# Obtain OCR string
ocr_string = pytesseract.image_to_string(img_rgb[top:bot, left:right, :], lang="deu")
print(ocr_string)

去除小斑点可能会导致去除字母“i”欢迎使用堆栈溢出的圆点。这类问题通常是关于堆栈溢出的,应该在Genius上提出!非常感谢你