将扫描的pdf转换为文本python_Python_Pdf_Ocr_Ghostscript

将扫描的pdf转换为文本python

python pdf

将扫描的pdf转换为文本python,python,pdf,ocr,ghostscript,Python,Pdf,Ocr,Ghostscript,我有一个扫描过的pdf文件，我试图从中提取文本。我尝试使用pypdfocr在其上进行ocr，但出现错误： “在通常的位置找不到ghostscript” 搜索之后，我找到了这个解决方案，并尝试下载GhostScript并将其放入环境变量中，但仍然存在相同的错误如何使用python在扫描的pdf文件中搜索文本谢谢编辑：这是我的代码示例： import os import sys import re import json import shutil import glob from pypd

我有一个扫描过的pdf文件，我试图从中提取文本。我尝试使用pypdfocr在其上进行ocr，但出现错误：

“在通常的位置找不到ghostscript”

搜索之后，我找到了这个解决方案，并尝试下载GhostScript并将其放入环境变量中，但仍然存在相同的错误

如何使用python在扫描的pdf文件中搜索文本

谢谢

编辑：这是我的代码示例：

import os
import sys
import re
import json
import shutil
import glob
from pypdfocr import pypdfocr_gs
from pypdfocr import pypdfocr_tesseract 
from PIL import Image

path = PATH_TO_MY_SCANNED_PDF
mainL = []
kk = {}


def new_init(self, kk):
    self.lang = 'heb'   
    self.binary = "tesseract"
    self.msgs = {
            'TS_MISSING': """ 
                Could not execute %s
                Please make sure you have Tesseract installed correctly
                """ % self.binary,
            'TS_VERSION':'Tesseract version is too old',
            'TS_img_MISSING':'Cannot find specified tiff file',
            'TS_FAILED': 'Tesseract-OCR execution failed!',
        }

pypdfocr_tesseract.PyTesseract.__init__ = new_init  

wow = pypdfocr_gs.PyGs(kk)
tt = pypdfocr_tesseract.PyTesseract(kk)


def secFile(filename,oldfilename):
    wow.make_img_from_pdf(filename)


    files = glob.glob("X:/e206333106/ocr-114/balagan/" + '*.jpg')  
    for file in files:
        im = Image.open(file)
        im.save(file + ".tiff") 

    files = glob.glob("PATH" + '*.tiff')  
    for file in files:
        tt.make_hocr_from_pnm(file)
    pdftxt = ""    
    files = glob.glob("PATH" + '*.html') 
    for file in files:
        with open(file) as myfile:
            pdftxt = pdftxt + "#" + "".join(line.rstrip() for line in myfile)
    findNum(pdftxt,oldfilename)

    folder ="PATH"

    for the_file in os.listdir(folder):
        file_path = os.path.join(folder, the_file)
        try:
            if os.path.isfile(file_path):
                os.unlink(file_path)
        except Exception, e:
            print e

def pdf2ocr(filename):
    pdffile = filename
    os.system('pypdfocr -l heb ' + pdffile)

def ocr2txt(filename):  
    pdffile = filename


    output1 = pdffile.replace(".pdf","_ocr.txt")
    output1 = "PATH" + os.path.basename(output1)

    input1 = pdffile.replace(".pdf","_ocr.pdf")

    os.system("pdf2txt" -o  + output1 + " " + input1) 

    with open(output1) as myfile:
        pdftxt="".join(line.rstrip() for line in myfile)
    findNum(pdftxt,filename)


def findNum(pdftxt,pdffile):
    l = re.findall(r'\b\d+\b', pdftxt)


    output = open('PATH' + os.path.basename(pdffile) + '.txt', 'w')
    for i in l:
        output.write(",")
        output.write(i)
    output.close()    

def is_ascii(s):
    return all(ord(c) < 128 for c in s)

i = 0     
files = glob.glob(path + '\\*.pdf') 
print path  
print files 
for file in files:
    if file.endswith(".pdf"):
        if is_ascii(file):
            print file
            pdf2ocr(file)    
            ocr2txt(file)
        else:
            newname = "PATH" + str(i) + ".pdf"
            shutil.copyfile(file, newname)
            print newname
            secFile(newname,file)
        i = i + 1

files = glob.glob(path + '\\' + '*_ocr.pdf')         

for file in files:
    print file
    shutil.copyfile(file, "PATH" + os.path.basename(file))
    os.remove(file)

导入操作系统
导入系统
进口稀土
导入json
进口舒蒂尔
导入glob
从pypdfocr导入pypdfocr\u gs
从pypdfocr导入pypdfocr_tesseract
从PIL导入图像
路径=路径到我的扫描的PDF
mainL=[]
kk={}
def new_init（自我，kk）：
self.lang='heb'
self.binary=“tesseract”
self.msgs={
“TS_缺失”：“
无法执行%s
请确保已正确安装Tesseract
“”%self.binary，
‘TS_版本’：‘Tesseract版本太旧’，
“TS_img_缺少”：“找不到指定的tiff文件”，
“TS_失败”：“Tesseract OCR执行失败！”，
}
pypdfocr_tesseract.PyTesseract.\uuuu init\uuuu=new\u init
哇=pypdfocr_gs.PyGs（kk）
tt=pypdfocr_tesseract.PyTesseract（kk）
def secFile（文件名，旧文件名）：
哇。从pdf（文件名）制作图片
files=glob.glob（“X:/e206333106/ocr-114/balagan/“+”*.jpg”）
对于文件中的文件：
im=图像。打开（文件）
im.save（文件+“.tiff”）
files=glob.glob（“路径“+”*.tiff”）
对于文件中的文件：
tt.make_hocr_from_pnm（文件）
pdftxt=“”
files=glob.glob（“路径“+”*.html”）
对于文件中的文件：
以myfile的形式打开（文件）：
pdftxt=pdftxt+“#”+”“.join（用于myfile中的行的line.rstrip（））
findNum（pdftxt，旧文件名）
folder=“PATH”
对于os.listdir（文件夹）中的_文件：
file\u path=os.path.join（文件夹，该文件）
尝试：
如果os.path.isfile（文件路径）：
取消链接（文件路径）
除例外情况外，e：
打印e
def pdf2ocr（文件名）：
pdffile=filename
操作系统（'pypdfocr-l heb'+pdffile）
def OCR2Text（文件名）：
pdffile=filename
output1=pdffile.replace（“.pdf”，“_ocr.txt”）
output1=“PATH”+os.PATH.basename（output1）
input1=pdffile.replace（“.pdf”和“_ocr.pdf”）
操作系统（“pdf2txt”-o+输出1+“”+输入1）
以myfile形式打开（output1）：
pdftxt=”“.join（用于myfile中的行的line.rstrip（））
findNum（pdftxt，文件名）
def findNum（pdftxt，pdffile）：
l=re.findall（r'\b\d+\b'，pdftxt）
output=open（'PATH'+os.PATH.basename（pdffile）+'.txt'，'w'）
对于l中的i：
输出。写入（“，”）
输出写入（i）
output.close（）
def是_ascii码：
返回全部（ord（c）<128表示c在s中）
i=0
files=glob.glob（路径+'\\*.pdf'）
打印路径
打印文件
对于文件中的文件：
如果文件.endswith（“.pdf”）：
如果是ascii（文件）：
打印文件
pdf2ocr（文件）
OCR2Text（文件）
其他：
newname=“PATH”+str（i）+“.pdf”
copyfile（文件，新名称）
打印新名称
secFile（新名称，文件）
i=i+1
files=glob.glob（路径+'\'+'*\u ocr.pdf'）
对于文件中的文件：
打印文件
shutil.copyfile（文件，“PATH”+os.PATH.basename（文件））
删除（文件）

看看这个库：

但PDF文件中也可以包含图像。您可以分析页面内容流。一些扫描仪将单个扫描页面分解为图像，因此无法使用ghostscript获取文本。

您可以使用OpenCV for python。有很多关于文本检测的信息。

看看我的代码，它对我有用

import os
import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
import gc



pdf=wi(filename=pdf_path,resolution=300)
pdfImg=pdf.convert('jpeg')

imgBlobs=[]
extracted_text=[]

def Get_text_from_image(pdf_path):
    pdf=wi(filename=pdf_path,resolution=300)
    pdfImg=pdf.convert('jpeg')
    imgBlobs=[]
    extracted_text=[]
    for img in pdfImg.sequence:
        page=wi(image=img)
        imgBlobs.append(page.make_blob('jpeg'))

    for imgBlob in imgBlobs:
        im=Image.open(io.BytesIO(imgBlob))
        text=pytesseract.image_to_string(im,lang='eng')
        extracted_text.append(text)

    return (extracted_text)

我通过编辑/etc/ImageMagick-6/policy.xml来修复它，并将pdf行的权限更改为“读写”：

打开终端并更改路径

cd /etc/ImageMagick-6
nano policy.xml
<policy domain="coder" rights="read" pattern="PDF" /> 
change to
<policy domain="coder" rights="read|write" pattern="PDF" />
exit

转换PDF，使用PyteSeract进行OCR，并将PDF中的每个页面导出到文本文件

安装这些

conda安装-c conda forge pytesseract

conda安装-c conda forge tesseract

pip安装pdf2image

导入pytesseract
从PDF2图像导入从路径转换
导入glob
pdfs=glob.glob（r“yourPath\*.pdf”）
对于pdf中的pdf_路径：
pages=从路径转换路径（pdf路径，500）
对于pageNum，枚举中的imgBlob（页）：
text=pytesseract.image_to_字符串（imgBlob，lang='eng'）
打开（f'{pdf_path[：-4]}{pageNum}.txt'，w'）作为_文件：
_file.write（文本）

PyPDF2是一个作为PDF工具包构建的python库。它能够：

Extracting document information (title, author, …)
Splitting documents page by page
Merging documents page by page
Cropping pages
Merging multiple pages into a single page
Encrypting and decrypting PDF files
and more!

要安装PyPDF2，请从命令行运行以下命令：

pip install PyPDF2

代码：

导入PyPDF2
pdfFileObj=open（'myPdf.pdf'，'rb'）
pdfReader=PyPDF2.PdfileReader（PdfileObj）
打印（pdfReader.numPages）
pageObj=pdfReader.getPage（0）
打印（pageObj.extractText（））
pdfFileObj.close（）

你能提供你的代码示例吗？我在我的问题中编辑了这个，仍然是相同的错误，我在命令行中编写了pypdfocr filename.pdf，错误：错误：无法在通常的位置找到Ghostscript；请使用配置文件指定您使用的操作系统？我使用的是windows 64位。您是否使用pip安装了ghostscript

pip install ghostscript

可能是在尝试查找32位版本的GS，尝试安装我找不到如何将其用于pdf文件。将pdf打印为图像（png或jpeg），然后您可以使用OpenCV OCR。但我认为这不适用于OCR。这非常适合文本格式的pdf。一旦您尝试输入扫描的文本（例如图像），它将不起作用。是否有方法提取文本的位置、字体、大小等，以便您可以创建一个包含文本的pdf文件

pip install PyPDF2