Python 3.x 从pdf-PyPDF2中提取文本_Python 3.x_Pypdf2

Python 3.x 从pdf-PyPDF2中提取文本

python-3.x

Python 3.x 从pdf-PyPDF2中提取文本,python-3.x,pypdf2,Python 3.x,Pypdf2,我将按照页面上的教程从pdf中提取文本：我可以打印pdf信息，但我不能打印页面的内容。它没有抛出任何错误，但我也看不到pdf的文本有什么问题吗 from PyPDF2 import PdfFileReader def get_info(path): with open(path, 'rb') as f: pdf = PdfFileReader(f) info = pdf.getDocumentInfo() number_of_pa

我将按照页面上的教程从pdf中提取文本：

我可以打印pdf信息，但我不能打印页面的内容。它没有抛出任何错误，但我也看不到pdf的文本

有什么问题吗

from PyPDF2 import PdfFileReader


def get_info(path):
    with open(path, 'rb') as f:
        pdf = PdfFileReader(f)
        info = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()

    #print(info)

    author = info.author
    creator = info.creator
    producer = info.producer
    subject = info.subject
    title = info.title


    print(author)
    print(creator)
    print(producer)
    print(subject)
    print(title)

def text_extractor(path):
    with open(path, 'rb') as f:
        pdf = PdfFileReader(f)

        # get the first page
        page = pdf.getPage(0)
        print(page)
        print('Page type: {}'.format(str(type(page))))

        text = page.extractText()

        print(text) #THIS PART SHOULD PRINT TEXT FROM PDF, BUT DOESNT WORK



if __name__ == '__main__':
        #URL PDF: https://oficinavirtual.ugr.es/apli/solicitudPAU/test.pdf
    path = 'test.pdf'
    get_info(path)
    print("\n"*2)
    text_extractor(path)

虽然这不是解决方案，但您可以简单地使用pip安装

pdfminer3

，并使用最小的可复制示例

您的代码对我来说运行良好。这可能来自您正在使用的文件（可能第一页是空的？）。另外，我已经使用PyPDF2有一段时间了，它并不是万无一失的：用旧版本的Adobe编码的PDF，或者从奇怪的格式转换的PDF可能不起作用，抛出异常，或者只是返回胡言乱语。因此，请使用不同的文件测试代码，并使用try/except。这是否回答了您的问题？