Python 如何从PDF文件中提取文本和文本坐标？_Python_Pdf_Pdfminer

Python 如何从PDF文件中提取文本和文本坐标？

python pdf

Python 如何从PDF文件中提取文本和文本坐标？,python,pdf,pdfminer,Python,Pdf,Pdfminer,我想用PDFMiner从PDF文件中提取所有文本框和文本框坐标许多其他的堆栈溢出帖子解决了如何以有序的方式提取所有文本的问题，但是如何才能完成获取文本和文本位置的中间步骤呢给定一个PDF文件，输出应如下所示： 489, 41, "Signature" 500, 52, "b" 630, 202, "a_g_i_r" 换行符在最终输出中转换为下划线。这是我找到的最小工作解决方案 from pdfminer.pdfparser i

我想用PDFMiner从PDF文件中提取所有文本框和文本框坐标

许多其他的堆栈溢出帖子解决了如何以有序的方式提取所有文本的问题，但是如何才能完成获取文本和文本位置的中间步骤呢

给定一个PDF文件，输出应如下所示：

489, 41,  "Signature"
500, 52,  "b"
630, 202, "a_g_i_r"

换行符在最终输出中转换为下划线。这是我找到的最小工作解决方案

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer

# Open a PDF file.
fp = open('/Users/me/Downloads/test.pdf', 'rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)

# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)

# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()

# Create a PDF device object.
device = PDFDevice(rsrcmgr)

# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)

# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)

def parse_obj(lt_objs):

    # loop over the object list
    for obj in lt_objs:

        # if it's a textbox, print text and location
        if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
            print "%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text().replace('\n', '_'))

        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)

# loop over all pages in the document
for page in PDFPage.create_pages(document):

    # read the page into a layout object
    interpreter.process_page(page)
    layout = device.get_result()

    # extract text from this object
    parse_obj(layout._objs)

下面是一个可复制粘贴的示例，它列出了PDF中每个文本块的左上角，我认为它适用于任何不包含包含包含文本的“表单XObject”的PDF：

from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator

fp = open('yourpdf.pdf', 'rb')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)

for page in pages:
    print('Processing next page...')
    interpreter.process_page(page)
    layout = device.get_result()
    for lobj in layout:
        if isinstance(lobj, LTTextBox):
            x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
            print('At %r is text: %s' % ((x, y), text))

上面的代码基于PDFMiner文档中的示例，加上pnj（）和Matt Swain（）的示例。我在前面的示例中做了一些更改：

我使用

PDFPage.get_pages（）

，这是创建文档、检查

是否可提取并将其传递给PDFPage.create_pages（）


我不需要费心处理LTFigure
s，因为PDFMiner目前无法干净地处理其中的文本


LAParams
允许您设置一些参数，这些参数控制PDF中的单个字符如何通过PDFMiner神奇地分组到行和文本框中。如果您感到惊讶的是，这样的分组是一件必须发生的事情，那么它在以下方面是合理的：
在实际的PDF文件中，文本部分可能在运行期间被分成几个块，这取决于创作软件。因此，文本提取需要拼接文本块
与大多数PDFMiner一样，LAParams
的参数是未记录的，但是您可以通过调用Python shell中的help（LAParams）
来查看它们。某些参数的含义在中给出，因为它们也可以作为参数在命令行中传递给pdf2text

上面的layout
对象是一个LTPage
，它是“layout objects”的一个iterable。每个布局对象都可以是以下类型之一

LTTextBox
LTFigure
LTImage
LTLine
LTRect

。。。或者它们的子类。（特别是，您的文本框可能都是水平的lttextbox
s。）
文档中的此图像显示了LTPage
结构的更多细节：
LTPage

。与此答案相关的是：

LTPage

包含上述5种类型，

LTTextBox

包含

LTTextLine

s和未指定的其他内容，

LTTextLine

包含

LTChar

s、

LTAnno

s、

LTText

s和未指定的其他内容。“>

上述每种类型都有一个

.bbox

属性，该属性包含一个（x0、y0、x1、y1）元组，分别包含对象的左、下、右和顶部的坐标。y坐标表示从页面底部到页面底部的距离。如果y轴从上到下移动更方便，可以从页面的

高度中减去它们。mediabox

：

x0, y0_orig, x1, y1_orig = some_lobj.bbox
y0 = page.mediabox[3] - y1_orig
y1 = page.mediabox[3] - y0_orig

除了

bbox

，

LTTextBox

es还有一个

.get_text（）

方法，如上所示，该方法以字符串形式返回文本内容。请注意，每个

LTTextBox

都是

LTChar

的集合（PDF明确绘制的字符，带有

bbox

）和

LTAnno

s（PDFMiner根据相距很远的字符添加到文本框内容的字符串表示中的额外空格；这些空格没有

bbox

）

本答案开头的代码示例结合了这两个属性来显示每个文本块的坐标

最后，值得注意的是，与上面提到的其他堆栈溢出答案不同，我不需要麻烦地递归到

LTFigure

s。虽然

LTFigure

s可以包含文本，但PDFMiner似乎无法将文本分组到

LTTextBox

es中（您可以在PDF中的示例中尝试自己）而是生成一个直接包含

LTChar

对象的

LTFigure

。原则上，您可以找出如何将这些对象组合成一个字符串，但PDFMiner（从20181108版起）无法为您实现这一点

不过，希望您需要解析的PDF不会使用包含文本的表单XObject，因此此警告不适用于您。

另请参阅，在这一次之后几个月发布的一个副本。我留下了自己的答案，从几个方面对此进行了调整。您在这里创建的第一个

设备以及初始的cruft设置我特别好奇地想知道：你有没有发现过这样一种情况：递归到LTFigure
s起作用？我自己的实验告诉我，它们里面的文本不会被PDFMiner分组到textbox对象中，因此你在这里对它们的递归永远不会发生ork.Hi我有一个发票格式的pdf。我可以给出要提取的文本的位置，它可以提取这些文本字段吗？