Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/289.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python PDFMiner-读取行而不是列_Python_Python 3.x_Pdf_Pdfminer - Fatal编程技术网

Python PDFMiner-读取行而不是列

Python PDFMiner-读取行而不是列,python,python-3.x,pdf,pdfminer,Python,Python 3.x,Pdf,Pdfminer,我找到了一些用于提取pdf数据的代码。但从输出来看,它会逐列提取。有没有办法让pdfminer.six逐行读取数据 这是我使用的代码(只是为了可读性而对原始和删除的注释进行了轻微修改) 提前谢谢 from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage from pdfminer.pdfpage import

我找到了一些用于提取pdf数据的代码。但从输出来看,它会逐列提取。有没有办法让pdfminer.six逐行读取数据

这是我使用的代码(只是为了可读性而对原始和删除的注释进行了轻微修改)

提前谢谢

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer


fp = open('test.pdf', 'rb')

parser = PDFParser(fp)

document = PDFDocument(parser)

if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

rsrcmgr = PDFResourceManager()

device = PDFDevice(rsrcmgr)

laparams = LAParams()

device = PDFPageAggregator(rsrcmgr, laparams=laparams)

interpreter = PDFPageInterpreter(rsrcmgr, device)

def parse_obj(lt_objs):

    for obj in lt_objs:
        if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
            print("{}".format(obj.get_text().replace("\n", "")))
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)

for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    layout = device.get_result()

    parse_obj(layout._objs)