Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/314.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将PDF转换为文本;“不允许提取文本”;_Python_Python 3.x_Pdfminer - Fatal编程技术网

Python 将PDF转换为文本;“不允许提取文本”;

Python 将PDF转换为文本;“不允许提取文本”;,python,python-3.x,pdfminer,Python,Python 3.x,Pdfminer,我正在尝试用Python将PDF转换为文本。但这给了我一个错误: PDFTextractionNotallowed:不允许进行文本提取: 我使用的代码是: import sys from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.converter import XMLConverter, HTMLConve

我正在尝试用Python将PDF转换为文本。但这给了我一个错误:

PDFTextractionNotallowed:不允许进行文本提取:

我使用的代码是:

import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
import io    


def pdfparser(data):
    fp = open(data, 'rb')      
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
        data = retstr.getvalue()

    return data


if __name__ == '__main__':
    text = pdfparser(Input_path)
有人能帮我吗

文件路径为:


之所以出现错误,是因为
data=retstr.getvalue()
行中的缩进错误,它应该在for循环之外

但是,在修复该问题后,我遇到了一些其他问题,因此我将提供以下完整代码:

import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
import io    


def pdfparser(data):
    fp = open(data, 'rb')      
    rsrcmgr = PDFResourceManager()
    # retstr = io.StringIO() #This will cause -- `TypeError: unicode argument expected, got 'str'`
    retstr = io.BytesIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)

    data = retstr.getvalue() #Indentation was worng here
    fp.close()
    #print(data)
    return data


if __name__ == '__main__':
    #PDF file you provied is encrypted with blank password, we need to decrypt it
    path = sys.argv[1]
    from subprocess import call
    import os
    pdf_filename = os.path.basename(path)
    file_name, extension = os.path.splitext(pdf_filename)
    pdf_filename_decr = str(file_name) + "_decr" + extension
    call('qpdf --password=%s --decrypt %s %s' %('', path, pdf_filename_decr), shell=True)

    text = pdfparser(pdf_filename_decr)

问题在于
PDFPage.get_pages()
检查文本是否可以按约定提取。您必须将标志设置为
check_extractable=False
,才能使其正常工作。此外,如果您试图转换为txt的PDF受密码保护,您也可以在那里进行更改。不幸的是,
PDFPage
对此并不十分清楚

password = ""
for page in PDFPage.get_pages(fp, check_extractable=False, password=password):
    interpreter.process_page(page)
data = retstr.getvalue()
您的整个代码如下所示:

import io

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage

def pdfparser(data):
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    with open(data, 'rb') as fp:
        for page in PDFPage.get_pages(fp,
                                      pagenos, 
                                      maxpages=maxpages,
                                      password=password,
                                      caching=caching,
                                      check_extractable=False):
            interpreter.process_page(page)

    # As pointed out in another answer, this goes outside the loop
    text = retstr.getvalue()

    device.close()
    retstr.close()
    return text

注意:Python的
与open…:
模式实现对于正确处理文件对象非常有用。

请小心,您没有关闭FPS,谢谢,但它仍然不工作。可能与此问题相同,PDF被标记为不允许提取:此缩进是由编辑代码的人引入的。以前没有任何压痕!谢谢你的回复。我想问一下pdf\u filename\u decr将保存在哪里?当我运行这个程序时,它给出FileNotFound错误。