Python 如何提取pdf中输入字段的x0、y0坐标_Python_Pypdf2_Pdfminer

Python 如何提取pdf中输入字段的x0、y0坐标

python

Python 如何提取pdf中输入字段的x0、y0坐标,python,pypdf2,pdfminer,Python,Pypdf2,Pdfminer,我想要一个pdf文档，我想要输入字段的坐标（文本字段的左下角点）。有没有一种方法可以使用pyPDF2或pdfMiner之类的python库来实现这一点？以下图像可能有助于理解问题通常，这些字段要么是句点的重复，要么是下划线。您可以使用PyMuPDF提取pdf文件的文本行，并使用正则表达式（import re）识别此类重复，然后在识别匹配时将坐标保存到列表或类似内容下面的代码除了将（x0，y0，x1，y1）保存为左下角（x0，y0）和右上角（x1，y1）的坐标外，还可以执行此操作-您可以提取所

我想要一个pdf文档，我想要输入字段的坐标（文本字段的左下角点）。有没有一种方法可以使用pyPDF2或pdfMiner之类的python库来实现这一点？以下图像可能有助于理解问题

通常，这些字段要么是句点的重复，要么是下划线。您可以使用PyMuPDF提取pdf文件的文本行，并使用正则表达式（

import re

）识别此类重复，然后在识别匹配时将坐标保存到列表或类似内容

下面的代码除了将（x0，y0，x1，y1）保存为左下角（x0，y0）和右上角（x1，y1）的坐标外，还可以执行此操作-您可以提取所需的坐标

    def whichFields(self, txtline):
        reg = re.compile(r"(…|\..)\1+")
        self.matches.append(reg.finditer(txtline))
        return self.matches

    # Uses PyMuPDF to find box coordinates of the fields in matches[]
    # returns a list of the coordinates in the order which they
    # appear in matches[].
    def whereFields(self):
        global c
        count = 0
        for page in self.doc:
            field_areas = []
            c = self.newCanvas(count)
            page_num = count
            count += 1
            mts = []
            txtlines = page.getText("text").split("\n")  # using doc opened in fitz, splitting all text lines in page
            prev_area = []
            for j in txtlines:
                mts.append(self.whichFields(j))

            # These for loops access the result of the regex search and then ultimately pass
            # the matching strings to searchFor() which returns a list of coordinates of the
            # rectangles in which the searched "fields" are found.
            for data in mts:
                for match in data:
                    for i in match:
                        # extracts the matching string and searches for its rect coordinates.
                        self.areas = page.searchFor(i[1])
                        for area in self.areas:
                            field_areas.append(area)
`

通常，这些字段要么是句点的重复，要么是下划线。您可以使用PyMuPDF提取pdf文件的文本行，并使用正则表达式（

import re

）识别此类重复，然后在识别匹配时将坐标保存到列表或类似内容

下面的代码除了将（x0，y0，x1，y1）保存为左下角（x0，y0）和右上角（x1，y1）的坐标外，还可以执行此操作-您可以提取所需的坐标

    def whichFields(self, txtline):
        reg = re.compile(r"(…|\..)\1+")
        self.matches.append(reg.finditer(txtline))
        return self.matches

    # Uses PyMuPDF to find box coordinates of the fields in matches[]
    # returns a list of the coordinates in the order which they
    # appear in matches[].
    def whereFields(self):
        global c
        count = 0
        for page in self.doc:
            field_areas = []
            c = self.newCanvas(count)
            page_num = count
            count += 1
            mts = []
            txtlines = page.getText("text").split("\n")  # using doc opened in fitz, splitting all text lines in page
            prev_area = []
            for j in txtlines:
                mts.append(self.whichFields(j))

            # These for loops access the result of the regex search and then ultimately pass
            # the matching strings to searchFor() which returns a list of coordinates of the
            # rectangles in which the searched "fields" are found.
            for data in mts:
                for match in data:
                    for i in match:
                        # extracts the matching string and searches for its rect coordinates.
                        self.areas = page.searchFor(i[1])
                        for area in self.areas:
                            field_areas.append(area)
`