在Windows 7上使用Python 2.7从PDF文件中提取文本_Python_Pdf_Pypdf_Pdftotext_Pdfminer

在Windows 7上使用Python 2.7从PDF文件中提取文本

python pdf

在Windows 7上使用Python 2.7从PDF文件中提取文本,python,pdf,pypdf,pdftotext,pdfminer,Python,Pdf,Pypdf,Pdftotext,Pdfminer,我一直在使用PyPDF2使用Python2.7提取PDF文件（使用pdfTeX-1.40.0生成）中包含的文本。它工作正常，但现在我必须从libreoffice4.3生成的文本中提取文本，结果如下（不是全部）：这是我的代码： pdfFileObj = open(filePath, 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) pageText = "" for pageID in range(0, pdfR

我一直在使用PyPDF2使用Python2.7提取PDF文件（使用pdfTeX-1.40.0生成）中包含的文本。它工作正常，但现在我必须从libreoffice4.3生成的文本中提取文本，结果如下（不是全部）：

这是我的代码：

    pdfFileObj = open(filePath, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    pageText = ""
    for pageID in range(0, pdfReader.numPages): 
        pageObj = pdfReader.getPage(pageID)
        pageText = pageText + "\n" + str(pageObj.extractText().encode('utf-8')))
    for line in pageText:
        extInfo = extInfo + line
    pdfFileObj.close()

    if string2search.replace(' ','') in extInfo:
        stringPresent = True
    else:
        stringPresent = False

windows机器有没有简单的工作解决方案？我找到了这个话题，但没有解决办法。我还尝试使用主题中的PDFMiner，但我遇到以下错误：

UnicodeEncodeError: 'ascii' codec cant encode character u'\xe9' in position 0: ordinal not in range (128)

我相信你的问题在于阅读前的编码

pdfFileObj = open(filePath, 'rb',encoding="utf-8") 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageText = ""
for pageID in range(0, pdfReader.numPages): 
    pageObj = pdfReader.getPage(pageID)
    pageText = pageText + "\n" + str(pageObj.extractText().encode('utf-8')))
for line in pageText:
    extInfo = extInfo + line
pdfFileObj.close()

if string2search.replace(' ','') in extInfo:
    stringPresent = True
else:
    stringPresent = False

我终于找到了解决办法

1.-下载适用于windows的Xpdf工具

2.-将pdftotext.exe从xpdf-tools-win-4.00\bin32复制到C:\Windows\System32，也复制到C:\Windows\SysWOW64

3.-使用代码：

import subprocess

try:
    extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
except Exception as e:
    print (e) 

if string2search in extInfo:
    stringPresent = True
else:
    stringPresent = False

我尝试过它，但得到错误：“TypeError:“encoding”是此函数的无效关键字参数“try to”r而不是“rb”错误：“”PyPDF2.utils.PdfReadError:找不到EOF标记”

import subprocess

try:
    extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
except Exception as e:
    print (e) 

if string2search in extInfo:
    stringPresent = True
else:
    stringPresent = False