Python 3.x 阅读PDF文档中的所有书签，并创建带有书签页码和标题的词典_Python 3.x_Pypdf2

Python 3.x 阅读PDF文档中的所有书签，并创建带有书签页码和标题的词典

python-3.x

Python 3.x 阅读PDF文档中的所有书签，并创建带有书签页码和标题的词典,python-3.x,pypdf2,Python 3.x,Pypdf2,我正在尝试使用Python和PyPDF2包读取PDF文档。目标是阅读pdf中的所有书签，并构建一个以书签页码为键、书签标题为值的词典除了这篇文章外，互联网上对如何实现这一目标的支持并不多。它中发布的代码不起作用，我不是python方面的专家来纠正它。PyPDF2的reader对象有一个名为outlines的属性，它为您提供了所有书签对象的列表，但没有书签的页码，遍历列表并不困难，因为书签之间没有父/子关系我在下面分享我的代码来阅读pdf文档并检查属性导入PyPDF2 pdfObj=open

我正在尝试使用Python和PyPDF2包读取PDF文档。目标是阅读pdf中的所有书签，并构建一个以书签页码为键、书签标题为值的词典

除了这篇文章外，互联网上对如何实现这一目标的支持并不多。它中发布的代码不起作用，我不是python方面的专家来纠正它。PyPDF2的reader对象有一个名为outlines的属性，它为您提供了所有书签对象的列表，但没有书签的页码，遍历列表并不困难，因为书签之间没有父/子关系

我在下面分享我的代码来阅读pdf文档并检查属性

导入PyPDF2
pdfObj=open（'SomeDocument.pdf'，'rb'）
readerObj=PyPDF2.PdfFileReader（pdfObj）
打印（readerObj.numPages）
打印（readerObj.outlines[1][1]）

通过将列表相互嵌套，可以保留父/子关系。此示例代码将以缩进目录的形式递归显示书签：

import PyPDF2


def show_tree(bookmark_list, indent=0):
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call with increased indentation
            show_tree(item, indent + 4)
        else:
            print(" " * indent + item.title)


reader = PyPDF2.PdfFileReader("[your filename]")

show_tree(reader.getOutlines())

我不知道如何检索页码。我尝试了几个文件，而

目标

对象的

页面

属性始终是

间接对象

的一个实例，它似乎不包含任何关于页码的信息

更新：

有一种方法可以从

Destination

对象获取页码。修改代码以创建所需词典：

import PyPDF2


def bookmark_dict(bookmark_list):
    result = {}
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            result.update(bookmark_dict(item))
        else:
            result[reader.getDestinationPageNumber(item)] = item.title
    return result


reader = PyPDF2.PdfFileReader("[your filename]")

print(bookmark_dict(reader.getOutlines()))

但是，请注意，如果同一页面上有多个书签（字典键必须是唯一的），则会覆盖并丢失一些值。

通过将列表嵌套在彼此中来保留父/子关系。此示例代码将以缩进目录的形式递归显示书签：

import PyPDF2


def show_tree(bookmark_list, indent=0):
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call with increased indentation
            show_tree(item, indent + 4)
        else:
            print(" " * indent + item.title)


reader = PyPDF2.PdfFileReader("[your filename]")

show_tree(reader.getOutlines())

我不知道如何检索页码。我尝试了几个文件，而

目标

对象的

页面

属性始终是

间接对象

的一个实例，它似乎不包含任何关于页码的信息

更新：

有一种方法可以从

Destination

对象获取页码。修改代码以创建所需词典：

import PyPDF2


def bookmark_dict(bookmark_list):
    result = {}
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            result.update(bookmark_dict(item))
        else:
            result[reader.getDestinationPageNumber(item)] = item.title
    return result


reader = PyPDF2.PdfFileReader("[your filename]")

print(bookmark_dict(reader.getOutlines()))

但是，请注意，如果同一页上有多个书签（字典键必须唯一），则会覆盖并丢失一些值。

PyPDF2已失效。以下是如何使用PyMupdf和：

从键入import Dict开始
导入fitz#pip安装pymupdf
def get_书签（文件路径：str）->Dict[int，str]：
#警告！一页可以有多个书签！
书签={}
将fitz.open（filepath）作为文档：
toc=doc.getToC（）
对于toc中的级别、标题和页面：
书签[页面]=标题
返回书签
打印（获取书签（“my.pdf”））

PyPDF2已失效。以下是如何使用PyMupdf和：

从键入import Dict开始
导入fitz#pip安装pymupdf
def get_书签（文件路径：str）->Dict[int，str]：
#警告！一页可以有多个书签！
书签={}
将fitz.open（filepath）作为文档：
toc=doc.getToC（）
对于toc中的级别、标题和页面：
书签[页面]=标题
返回书签
打印（获取书签（“my.pdf”））

@myrmica提供了正确答案。该函数需要一些额外的错误处理来处理书签有缺陷的情况。我还在页码中添加了1，因为它们是以零为基础的

import PyPDF2

def bookmark_dict(bookmark_list):
    result = {}
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            result.update(bookmark_dict(item))
        else:
          try:
             result[reader.getDestinationPageNumber(item)+1] = item.title
          except:
             pass
    return result

reader = PyPDF2.PdfFileReader("[your filename]")

print(bookmark_dict(reader.getOutlines()))

@myrmica提供了正确的答案。该函数需要一些额外的错误处理来处理书签有缺陷的情况。我还在页码中添加了1，因为它们是以零为基础的

import PyPDF2

def bookmark_dict(bookmark_list):
    result = {}
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            result.update(bookmark_dict(item))
        else:
          try:
             result[reader.getDestinationPageNumber(item)+1] = item.title
          except:
             pass
    return result

reader = PyPDF2.PdfFileReader("[your filename]")

print(bookmark_dict(reader.getOutlines()))