Python库PDFPL不提取行_Python_Pdf

Python库PDFPL不提取行

python pdf

Python库PDFPL不提取行,python,pdf,Python,Pdf,我试图使用pdfplumber从pdf文档中逐行提取文本我可以从pdf文档中打开一页，并逐页查看文本 pdf = pdfplumber.open(data_drive_os+dataloc+'/'+ file + '.pdf') page = pdf.pages[0] print(page.extract_text()) 这将产生以下文本： Anti-Money Laundering and Counter-Terrorism Financing Act 2006 No. 169, 2

我试图使用pdfplumber从pdf文档中逐行提取文本

我可以从pdf文档中打开一页，并逐页查看文本

pdf = pdfplumber.open(data_drive_os+dataloc+'/'+ file + '.pdf')

page = pdf.pages[0]
print(page.extract_text())

这将产生以下文本：

Anti-Money Laundering and 
Counter-Terrorism Financing Act 2006 
No. 169, 2006 
Compilation No. 48 
Compilation date:      20 December 2018 
Includes amendments up to:  Act No. 156, 2018 
Registered:        7 January 2019 
 
Prepared by the Office of Parliamentary Counsel, Canberra 
Authorised Version C2019C00011 registered 07/01/2019

所以我知道文本就在那里。但是，当我尝试逐行提取文本时，它返回一个空列表：

print(page.lines)

[]

[{'fontname': 'ABCDEE+Times New Roman', 'adv': Decimal('2.010'), 'upright': 1, 'x0': Decimal('120.500'), 'y0': Decimal('797.823'), 'x1': Decimal('122.510'), 'y1': Decimal('805.863'), 'width': Decimal('2.010'), 'height': Decimal('8.040'), 'size': Decimal('8.040'), 'object_type': 'char', 'page_number': 1, 'stroking_color': 0, 'non_stroking_color': 0, 'text': ' ', 'top': Decimal('36.057'), 'bottom': Decimal('44.097'), 'doctop': Decimal('36.057')}, {'fontname': 'ABCDEE+Times New Roman', 'adv': Decimal('6.138'), 'upright': 1, 'x0': Decimal('120.500'), 'y0': Decimal('170.315'), 'x1': Decimal('126.638'), 'y1': Decimal('181.355'), 'width': Decimal('6.138'), 'height': Decimal('11.040'), 'size': Decimal('11.040'), 'object_type': 'char', 'page_number': 1, 'stroking_color': 0, 'non_stroking_color': 0, 'text': 'P', 'top': Decimal('660.565'), 'bottom': Decimal('671.605'), 'doctop': Decimal('660.565')}, {'fontname': 'ABCDEE+Times New Roman', 'adv': Decimal('3.676'), 'upright': 1, 'x0': Decimal('126.638'), 'y0': Decimal('170.315'), 'x1': Decimal('130.315'), 'y1': Decimal('181.355'), 'width': Decimal('3.676'), 'height': Decimal('11.040'), 'size': Decimal('11.040'), 'object_type': 'char', 'page_number': 1, 'stroking_color': 0, 'non_stroking_color': 0, 'text': 'r', 'top': Decimal('660.565'), 'bottom': Decimal('671.605'), 'doctop': Decimal('660.565')}, {'fontname': 'ABCDEE+Times New Roman', 'adv': Decimal('4.902'), 'upright': 1, 'x0': Decimal('130.315'), 'y0': Decimal('170.315'), 'x1': Decimal('135.216'), 'y1': Decimal('181.355'), 'width': Decimal('4.902'), 'height': Decimal

我还可以提取字符：

print(page.chars)

[]

[{'fontname': 'ABCDEE+Times New Roman', 'adv': Decimal('2.010'), 'upright': 1, 'x0': Decimal('120.500'), 'y0': Decimal('797.823'), 'x1': Decimal('122.510'), 'y1': Decimal('805.863'), 'width': Decimal('2.010'), 'height': Decimal('8.040'), 'size': Decimal('8.040'), 'object_type': 'char', 'page_number': 1, 'stroking_color': 0, 'non_stroking_color': 0, 'text': ' ', 'top': Decimal('36.057'), 'bottom': Decimal('44.097'), 'doctop': Decimal('36.057')}, {'fontname': 'ABCDEE+Times New Roman', 'adv': Decimal('6.138'), 'upright': 1, 'x0': Decimal('120.500'), 'y0': Decimal('170.315'), 'x1': Decimal('126.638'), 'y1': Decimal('181.355'), 'width': Decimal('6.138'), 'height': Decimal('11.040'), 'size': Decimal('11.040'), 'object_type': 'char', 'page_number': 1, 'stroking_color': 0, 'non_stroking_color': 0, 'text': 'P', 'top': Decimal('660.565'), 'bottom': Decimal('671.605'), 'doctop': Decimal('660.565')}, {'fontname': 'ABCDEE+Times New Roman', 'adv': Decimal('3.676'), 'upright': 1, 'x0': Decimal('126.638'), 'y0': Decimal('170.315'), 'x1': Decimal('130.315'), 'y1': Decimal('181.355'), 'width': Decimal('3.676'), 'height': Decimal('11.040'), 'size': Decimal('11.040'), 'object_type': 'char', 'page_number': 1, 'stroking_color': 0, 'non_stroking_color': 0, 'text': 'r', 'top': Decimal('660.565'), 'bottom': Decimal('671.605'), 'doctop': Decimal('660.565')}, {'fontname': 'ABCDEE+Times New Roman', 'adv': Decimal('4.902'), 'upright': 1, 'x0': Decimal('130.315'), 'y0': Decimal('170.315'), 'x1': Decimal('135.216'), 'y1': Decimal('181.355'), 'width': Decimal('4.902'), 'height': Decimal

等等，这里肯定有文字

通过阅读文档页面，我应该能够使用.line生成这些行，但它不起作用。我做错什么了吗？

答案在您发布的文档中：

.线，每个线代表一条一维线

这是指几何线条（矢量元素），而不是文本线条。PDF没有文本行（或任何更高顺序的字符集合）的概念。

如果要检测文本行，最好的方法可能是循环检查PDF对象的每个字符，检查字符元数据的更改。pdfplumber提供了大量元数据，但这里对您最有用的可能是：

财产描述 y0 字符底部到页面底部的距离。 y1 字符顶部到页面底部的距离。顶部字符顶部到页面顶部的距离。底部字符底部到页面顶部的距离。