Python 将PDF转换为文本;“不允许提取文本”;
我正在尝试用Python将PDF转换为文本。但这给了我一个错误: PDFTextractionNotallowed:不允许进行文本提取: 我使用的代码是:Python 将PDF转换为文本;“不允许提取文本”;,python,python-3.x,pdfminer,Python,Python 3.x,Pdfminer,我正在尝试用Python将PDF转换为文本。但这给了我一个错误: PDFTextractionNotallowed:不允许进行文本提取: 我使用的代码是: import sys from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.converter import XMLConverter, HTMLConve
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
import io
def pdfparser(data):
fp = open(data, 'rb')
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
data = retstr.getvalue()
return data
if __name__ == '__main__':
text = pdfparser(Input_path)
有人能帮我吗
文件路径为:
之所以出现错误,是因为
data=retstr.getvalue()
行中的缩进错误,它应该在for循环之外
但是,在修复该问题后,我遇到了一些其他问题,因此我将提供以下完整代码:
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
import io
def pdfparser(data):
fp = open(data, 'rb')
rsrcmgr = PDFResourceManager()
# retstr = io.StringIO() #This will cause -- `TypeError: unicode argument expected, got 'str'`
retstr = io.BytesIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
data = retstr.getvalue() #Indentation was worng here
fp.close()
#print(data)
return data
if __name__ == '__main__':
#PDF file you provied is encrypted with blank password, we need to decrypt it
path = sys.argv[1]
from subprocess import call
import os
pdf_filename = os.path.basename(path)
file_name, extension = os.path.splitext(pdf_filename)
pdf_filename_decr = str(file_name) + "_decr" + extension
call('qpdf --password=%s --decrypt %s %s' %('', path, pdf_filename_decr), shell=True)
text = pdfparser(pdf_filename_decr)
问题在于
PDFPage.get_pages()
检查文本是否可以按约定提取。您必须将标志设置为check_extractable=False
,才能使其正常工作。此外,如果您试图转换为txt的PDF受密码保护,您也可以在那里进行更改。不幸的是,PDFPage
对此并不十分清楚
password = ""
for page in PDFPage.get_pages(fp, check_extractable=False, password=password):
interpreter.process_page(page)
data = retstr.getvalue()
您的整个代码如下所示:
import io
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage
def pdfparser(data):
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
with open(data, 'rb') as fp:
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=False):
interpreter.process_page(page)
# As pointed out in another answer, this goes outside the loop
text = retstr.getvalue()
device.close()
retstr.close()
return text
注意:Python的
与open…:
模式实现对于正确处理文件对象非常有用。请小心,您没有关闭FPS,谢谢,但它仍然不工作。可能与此问题相同,PDF被标记为不允许提取:此缩进是由编辑代码的人引入的。以前没有任何压痕!谢谢你的回复。我想问一下pdf\u filename\u decr将保存在哪里?当我运行这个程序时,它给出FileNotFound错误。