Python 为什么它不起作用?Pytesharact不工作问题
我认为Pytesharact不起作用了还是怎么了?我已经安装了tessaract。 pip安装pytesseract 已满足要求:c:\programdata\anaconda3\lib\site包中的PyteSeract(0.3.6) 已满足要求:c:\programdata\anaconda3\lib\site包中的枕头(来自pytesseract)(7.0.0) 注意:您可能需要重新启动内核以使用更新的软件包 系统规格: Windows 10、Python 3.7、anaconda 1.9.12Python 为什么它不起作用?Pytesharact不工作问题,python,ocr,tesseract,python-tesseract,Python,Ocr,Tesseract,Python Tesseract,我认为Pytesharact不起作用了还是怎么了?我已经安装了tessaract。 pip安装pytesseract 已满足要求:c:\programdata\anaconda3\lib\site包中的PyteSeract(0.3.6) 已满足要求:c:\programdata\anaconda3\lib\site包中的枕头(来自pytesseract)(7.0.0) 注意:您可能需要重新启动内核以使用更新的软件包 系统规格: Windows 10、Python 3.7、anaconda 1.9
# Import libraries
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
# Path of the pdf
PDF_file = "E:/python_test/test.pdf"
'''
Part #1 : Converting PDF to images
'''
# Store all the pages of the PDF in a variable
pages = convert_from_path(PDF_file, 500)
# Counter to store images of each page of PDF to image
image_counter = 1
# Iterate through all the pages stored above
for page in pages:
# Declaring filename for each page of PDF as JPG
# For each page, filename will be:
# PDF page 1 -> page_1.jpg
# PDF page 2 -> page_2.jpg
# PDF page 3 -> page_3.jpg
# ....
# PDF page n -> page_n.jpg
filename = "page_"+str(image_counter)+".jpg"
# Save the image of the page in system
page.save(filename, 'JPEG')
# Increment the counter to update filename
image_counter = image_counter + 1
'''
Part #2 - Recognizing text from the images using OCR
'''
# Variable to get count of total number of pages
filelimit = image_counter-1
# Creating a text file to write the output
outfile = "out_text.txt"
# Open the file in append mode so that
# All contents of all images are added to the same file
f = open(outfile, "a")
# Iterate from 1 to total number of pages
for i in range(1, filelimit + 1):
# Set filename to recognize text from
# Again, these files will be:
# page_1.jpg
# page_2.jpg
# ....
# page_n.jpg
filename = "page_"+str(i)+".jpg"
# Recognize the text as string in image using pytesserct
text = str(((pytesseract.image_to_string(Image.open(filename)))))
# The recognized text is stored in variable text
# Any string processing may be applied on text
# Here, basic formatting has been done:
# In many PDFs, at line ending, if a word can't
# be written fully, a 'hyphen' is added.
# The rest of the word is written in the next line
# Eg: This is a sample text this word here GeeksF-
# orGeeks is half on first line, remaining on next.
# To remove this, we replace every '-\n' to ''.
text = text.replace('-\n', '')
# Finally, write the processed text to the file.
f.write(text)
# Close the file after writing all the text.
f.close()
错误为:TesseractNotFoundError:tesseract未安装或不在您的路径中。有关更多信息,请参阅自述文件。
从以下链接下载tesseract ocr:
是否已将其添加到路径中?否。如何添加到路径中?路径中的
tesseract
文件夹。(首先需要安装tesseract
)这篇文章帮助了我