Python PyPDF挂在大图上
。我的代码(如下)挂在以下行:Python PyPDF挂在大图上,python,parsing,pypdf,Python,Parsing,Pypdf,。我的代码(如下)挂在以下行: content += " ".join(extract.strip().split()) 它挂在第21页,这是一幅大图。我不介意像这幅大图那样跳过几页,但我不知道该怎么做。有人能帮我吗 def ConvertPDFToText(self, pPDF): content = "" # Load PDF into pyPDF remoteFile = urlopen(pPDF).read() memoryFile = String
content += " ".join(extract.strip().split())
它挂在第21页,这是一幅大图。我不介意像这幅大图那样跳过几页,但我不知道该怎么做。有人能帮我吗
def ConvertPDFToText(self, pPDF):
content = ""
# Load PDF into pyPDF
remoteFile = urlopen(pPDF).read()
memoryFile = StringIO(remoteFile)
pdf = PdfFileReader(memoryFile)
print("Done reading")
# Iterate pages
try:
numPages = pdf.getNumPages()
print(str(numPages) + " pages detected")
for i in range(0, numPages):
# Extract text from page and add to content
page = pdf.getPage(i)
extract = page.extractText() + "\n"
content += " ".join(extract.strip().split())
except UnicodeDecodeError as ex:
print(self._Name + " - Unicode Error extracting pages: " + str(ex))
return ""
except Exception as ex:
print(self._Name + " - Generic Error extracting pages - " + str(ex))
return ""
# Decode the content. Since we don't know the encoding, we iterate through some possibilities.
encodings = ['utf8', 'windows-1250', 'windows-1252', 'utf16', 'utf32']
DecodedContent = ""
for code in encodings:
try:
DecodedContent = content.decode(code)
break
except Exception as ex:
continue
return DecodedContent
与其使用自2010年以来一直没有更新过的pyPdf,不如使用PyPDF2,pyPdf的新分支。你可以在这里找到它:
from PyPDF2 import PdfFileReader
#----------------------------------------------------------------------
def parse_pdf(pdf_file):
""""""
content = ""
pdf = PdfFileReader(open(pdf_file, 'rb'))
numPages = pdf.getNumPages()
for i in range(0, numPages):
# Extract text from page and add to content
page = pdf.getPage(i)
extract = page.extractText() + "\n"
content += " ".join(extract.strip().split())
if __name__ == "__main__":
pdf = "Kicking Horse Mountain Park Construction 2014.pdf"
parse_pdf(pdf)
我在第20页得到了
pyPdf.utils.PdfReadError:意外的转义字符串
。使用上面的代码大约30分钟后,我得到了一个分段错误