Python 如何提取pdf中输入字段的x0、y0坐标
我想要一个pdf文档,我想要输入字段的坐标(文本字段的左下角点)。有没有一种方法可以使用pyPDF2或pdfMiner之类的python库来实现这一点?以下图像可能有助于理解问题Python 如何提取pdf中输入字段的x0、y0坐标,python,pypdf2,pdfminer,Python,Pypdf2,Pdfminer,我想要一个pdf文档,我想要输入字段的坐标(文本字段的左下角点)。有没有一种方法可以使用pyPDF2或pdfMiner之类的python库来实现这一点?以下图像可能有助于理解问题 通常,这些字段要么是句点的重复,要么是下划线。您可以使用PyMuPDF提取pdf文件的文本行,并使用正则表达式(import re)识别此类重复,然后在识别匹配时将坐标保存到列表或类似内容 下面的代码除了将(x0,y0,x1,y1)保存为左下角(x0,y0)和右上角(x1,y1)的坐标外,还可以执行此操作-您可以提取所
通常,这些字段要么是句点的重复,要么是下划线。您可以使用PyMuPDF提取pdf文件的文本行,并使用正则表达式(
import re
)识别此类重复,然后在识别匹配时将坐标保存到列表或类似内容
下面的代码除了将(x0,y0,x1,y1)保存为左下角(x0,y0)和右上角(x1,y1)的坐标外,还可以执行此操作-您可以提取所需的坐标
def whichFields(self, txtline):
reg = re.compile(r"(…|\..)\1+")
self.matches.append(reg.finditer(txtline))
return self.matches
# Uses PyMuPDF to find box coordinates of the fields in matches[]
# returns a list of the coordinates in the order which they
# appear in matches[].
def whereFields(self):
global c
count = 0
for page in self.doc:
field_areas = []
c = self.newCanvas(count)
page_num = count
count += 1
mts = []
txtlines = page.getText("text").split("\n") # using doc opened in fitz, splitting all text lines in page
prev_area = []
for j in txtlines:
mts.append(self.whichFields(j))
# These for loops access the result of the regex search and then ultimately pass
# the matching strings to searchFor() which returns a list of coordinates of the
# rectangles in which the searched "fields" are found.
for data in mts:
for match in data:
for i in match:
# extracts the matching string and searches for its rect coordinates.
self.areas = page.searchFor(i[1])
for area in self.areas:
field_areas.append(area)
`
通常,这些字段要么是句点的重复,要么是下划线。您可以使用PyMuPDF提取pdf文件的文本行,并使用正则表达式(
import re
)识别此类重复,然后在识别匹配时将坐标保存到列表或类似内容
下面的代码除了将(x0,y0,x1,y1)保存为左下角(x0,y0)和右上角(x1,y1)的坐标外,还可以执行此操作-您可以提取所需的坐标
def whichFields(self, txtline):
reg = re.compile(r"(…|\..)\1+")
self.matches.append(reg.finditer(txtline))
return self.matches
# Uses PyMuPDF to find box coordinates of the fields in matches[]
# returns a list of the coordinates in the order which they
# appear in matches[].
def whereFields(self):
global c
count = 0
for page in self.doc:
field_areas = []
c = self.newCanvas(count)
page_num = count
count += 1
mts = []
txtlines = page.getText("text").split("\n") # using doc opened in fitz, splitting all text lines in page
prev_area = []
for j in txtlines:
mts.append(self.whichFields(j))
# These for loops access the result of the regex search and then ultimately pass
# the matching strings to searchFor() which returns a list of coordinates of the
# rectangles in which the searched "fields" are found.
for data in mts:
for match in data:
for i in match:
# extracts the matching string and searches for its rect coordinates.
self.areas = page.searchFor(i[1])
for area in self.areas:
field_areas.append(area)
`