如何使用Python从PDF中将表格提取为文本?
我有一个包含表格、文本和一些图像的PDF。我想在PDF中的任何位置提取表 现在我正在手动从页面中查找表。从那里,我捕获该页面并保存到另一个PDF如何使用Python从PDF中将表格提取为文本?,python,pdf,pdf-parsing,Python,Pdf,Pdf Parsing,我有一个包含表格、文本和一些图像的PDF。我想在PDF中的任何位置提取表 现在我正在手动从页面中查找表。从那里,我捕获该页面并保存到另一个PDF import PyPDF2 PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object
import PyPDF2
PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored
pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object
pg4 = pfr.getPage(126) #extract pg 127
writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object
#add pages
writer.addPage(pg4)
NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be
with open(NewPDFfilename, "wb") as outputStream:
writer.write(outputStream) #write pages to new PDF
我的目标是从整个PDF文档中提取表
- 我建议您使用tabla提取表格
- 将pdf作为参数传递给table api,它将以dataframe的形式返回表
- pdf中的每个表都作为一个数据帧返回
- 该表将在dataframea列表中返回,用于处理所需的dataframe
import pandas as pd
import tabula
file = "filename.pdf"
path = 'enter your directory path here' + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = True)
print(df)
有关更多详细信息,请参阅我的这篇文章。此答案适用于任何遇到带有图像的PDF并需要使用OCR的人。我找不到可行的现成解决方案;没有什么能给我提供我所需要的准确度 以下是我发现有效的步骤
pdfimages
from将pdf页面转换为图像mogrify
固定旋转pdfimages
和teseract
。我将为确实需要代码的两个步骤提供一些简短的示例
def cell_in_same_row(c1, c2):
c1_center = c1[1] + c1[3] - c1[3] / 2
c2_bottom = c2[1] + c2[3]
c2_top = c2[1]
return c2_top < c1_center < c2_bottom
orig_cells = [c for c in cells]
rows = []
while cells:
first = cells[0]
rest = cells[1:]
cells_in_same_row = sorted(
[
c for c in rest
if cell_in_same_row(c, first)
],
key=lambda c: c[0]
)
row_cells = sorted([first] + cells_in_same_row, key=lambda c: c[0])
rows.append(row_cells)
cells = [
c for c in rest
if not cell_in_same_row(c, first)
]
# Sort rows by average height of their center.
def avg_height_of_center(row):
centers = [y + h - h / 2 for x, y, w, h in row]
return sum(centers) / len(centers)
rows.sort(key=avg_height_of_center)
同一行(c1、c2)中的def单元:
c1_中心=c1[1]+c1[3]-c1[3]/2
c2_底部=c2[1]+c2[3]
c2_top=c2[1]
返回c2_顶部导入camelot
tables=camelot.read\u pdf('foo.pdf')
然后,您可以选择如何保存表(作为csv、json、excel、html、sqlite),以及是否应在ZIP存档中压缩输出
tables.export('foo.csv',f='csv',compress=False)
编辑:显示速度大约是camelot py的6倍,因此应改用它
导入camelot
导入cProfile
输入pstats
进口表格
cmd\u tabla=“tabla.read\u pdf('table.pdf',pages='1',lattice=True)”
prof_tabla=cProfile.Profile().run(cmd_tabla)
时间表格=统计数据(表格教授)。总计
cmd\u camelot=“camelot.read\u pdf('table.pdf',page='1',flavor='lattice')”
prof_camelot=cProfile.Profile().run(cmd_camelot)
时间\u camelot=pstats.Stats(教授\u camelot).total\u tt
打印(时间列表、时间列表、时间列表/时间列表)
给予
1.84955598900000015 11.057014036000016 5.978199147125147
这只适用于基于文本的PDF,而不适用于扫描的PDF。如果扫描的PDF使用图像处理技术,则有很多工具
def cell_in_same_row(c1, c2):
c1_center = c1[1] + c1[3] - c1[3] / 2
c2_bottom = c2[1] + c2[3]
c2_top = c2[1]
return c2_top < c1_center < c2_bottom
orig_cells = [c for c in cells]
rows = []
while cells:
first = cells[0]
rest = cells[1:]
cells_in_same_row = sorted(
[
c for c in rest
if cell_in_same_row(c, first)
],
key=lambda c: c[0]
)
row_cells = sorted([first] + cells_in_same_row, key=lambda c: c[0])
rows.append(row_cells)
cells = [
c for c in rest
if not cell_in_same_row(c, first)
]
# Sort rows by average height of their center.
def avg_height_of_center(row):
centers = [y + h - h / 2 for x, y, w, h in row]
return sum(centers) / len(centers)
rows.sort(key=avg_height_of_center)