Python PyPDF挂在大图上_Python_Parsing_Pypdf

Python PyPDF挂在大图上

python parsing

Python PyPDF挂在大图上,python,parsing,pypdf,Python,Parsing,Pypdf,。我的代码（如下）挂在以下行： content += " ".join(extract.strip().split()) 它挂在第21页，这是一幅大图。我不介意像这幅大图那样跳过几页，但我不知道该怎么做。有人能帮我吗 def ConvertPDFToText(self, pPDF): content = "" # Load PDF into pyPDF remoteFile = urlopen(pPDF).read() memoryFile = String

。我的代码（如下）挂在以下行：

 content += " ".join(extract.strip().split())

它挂在第21页，这是一幅大图。我不介意像这幅大图那样跳过几页，但我不知道该怎么做。有人能帮我吗

 def ConvertPDFToText(self, pPDF):
    content = ""
    # Load PDF into pyPDF
    remoteFile = urlopen(pPDF).read()
    memoryFile = StringIO(remoteFile)
    pdf = PdfFileReader(memoryFile)
    print("Done reading")
    # Iterate pages
    try:
        numPages = pdf.getNumPages()
        print(str(numPages) + " pages detected")
        for i in range(0, numPages):
            # Extract text from page and add to content
            page = pdf.getPage(i)
            extract = page.extractText() + "\n"
            content += " ".join(extract.strip().split())

    except UnicodeDecodeError as ex:
        print(self._Name + " - Unicode Error extracting pages: " + str(ex))
        return ""
    except Exception as ex:
        print(self._Name + " - Generic Error extracting pages - " + str(ex))
        return ""
    # Decode the content. Since we don't know the encoding, we iterate through some possibilities.

    encodings = ['utf8', 'windows-1250', 'windows-1252', 'utf16', 'utf32']
    DecodedContent = ""
    for code in encodings:
        try:
            DecodedContent = content.decode(code)
            break
        except Exception as ex:
            continue
    return DecodedContent

与其使用自2010年以来一直没有更新过的pyPdf，不如使用PyPDF2，pyPdf的新分支。你可以在这里找到它：

我刚刚在您的示例PDF中使用了它，虽然解析该文件需要一些时间，但效果很好。以下是我使用的代码：

from PyPDF2 import PdfFileReader

#----------------------------------------------------------------------
def parse_pdf(pdf_file):
    """"""
    content = ""
    pdf = PdfFileReader(open(pdf_file, 'rb'))
    numPages = pdf.getNumPages()
    for i in range(0, numPages):
        # Extract text from page and add to content
        page = pdf.getPage(i)
        extract = page.extractText() + "\n"
        content += " ".join(extract.strip().split())

if __name__ == "__main__":
    pdf = "Kicking Horse Mountain Park Construction 2014.pdf"
    parse_pdf(pdf)

我在第20页得到了

pyPdf.utils.PdfReadError:意外的转义字符串

。使用上面的代码大约30分钟后，我得到了一个分段错误