python docx从段落中获取表格_Python_Document_Docx_Paragraphs

python docx从段落中获取表格

python

python docx从段落中获取表格,python,document,docx,paragraphs,Python,Document,Docx,Paragraphs,我有.docx文件，其中包含许多段落和表格，如：帕尔1 表1 表2 表3 帕尔2 表1 表2 2.1第21段表1 表2 我需要迭代所有对象并生成字典，可能是json格式，如： {par1: [table1, table2, table3], par2[table1,table2, {par21: [table1,table2]} ] } {par1:[表1，表2，表3]，par2[表1，表2，{par21:[表1，表2]}] 从docx.api导入文档文件名='test.doc

我有.docx文件，其中包含许多段落和表格，如：

帕尔1

表1
表2
表3

帕尔2

表1
表2

2.1第21段

表1
表2

我需要迭代所有对象并生成字典，可能是json格式，如：

{par1: [table1, table2, table3], par2[table1,table2, {par21: [table1,table2]} ] } {par1:[表1，表2，表3]，par2[表1，表2，{par21:[表1，表2]}] 从docx.api导入文档文件名='test.docx' 文档=文档（docx=文件名）对于document.tables中的表：打印表对于文件中的段落。段落：打印段落.text 如何将每个段落和表格联系起来

您能提出一些建议吗？

在python docx库上还没有实现这种方法，但有一种变通方法可以按照docx中所有元素的显示顺序进行迭代：

您可以尝试遍历所有这些，检查对象是否是表或段落的实例，并以此为基础进行逻辑分析。

来自docx导入文档
from docx import Document
from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph
def iter_block_items(parent):
"""
Generate a reference to each paragraph and table child within *parent*,
in document order. Each returned value is an instance of either Table or
Paragraph. *parent* would most commonly be a reference to a main
Document object, but also works for a _Cell object, which itself can
contain paragraphs and tables.
"""
if isinstance(parent, _Document):
    parent_elm = parent.element.body
elif isinstance(parent, _Cell):
    parent_elm = parent._tc
elif isinstance(parent, _Row):
    parent_elm = parent._tr
else:
    raise ValueError("something's not right")
for child in parent_elm.iterchildren():
    if isinstance(child, CT_P):
        yield Paragraph(child, parent)
    elif isinstance(child, CT_Tbl):
        yield Table(child, parent)
document = Document('test.docx')
for block in iter_block_items(document):

#print(block.text if isinstance(block, Paragraph) else '<table>')
if isinstance(block, Paragraph):
    print(block.text)
elif isinstance(block, Table):
    for row in block.rows:
        row_data = []
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                row_data.append(paragraph.text)
        print("\t".join(row_data))

从docx.document导入文档作为_文档
从docx.oxml.text.paragration导入CT\u P
从docx.oxml.table导入CT\u Tbl
从docx.table导入\u单元格，表格
从docx.text.paragration导入段落
定义iter_块_项目（父项）：
"""
生成对*parent*中每个段落和表格子级的引用，
按文档顺序。每个返回值都是表或
第.*段“父项”通常是对主项的引用
Document对象，但也适用于_Cell对象，它本身可以
包含段落和表格。
"""
如果是实例（父文档）：
parent_elm=parent.element.body
elif isinstance（父单元格）：
父对象=父对象
elif isinstance（父行）：
父对象=父对象。\u tr
其他：
raise VALUE ERROR（“某些错误”）
对于父对象\ elm.iterchildren（）中的子对象：
如果存在（子项，CT\P）：
产生段落（子段、父段）
elif isinstance（儿童，CT_Tbl）：
收益表（子、父）
文档=文档（'test.docx'）
对于iter\U block\U项目中的block（文件）：
#打印（如果是实例（块，段落），则为block.text，否则为“”）
如果存在（块，段落）：
打印（块文本）
elif isinstance（块、表）：
对于block.rows中的行：
行_数据=[]
对于row.cells中的单元格：
对于单元格中的段落。段落：
行数据追加（段落文本）
打印（“\t”.join（行数据））

不确定我是否在帮忙，但以下是我的做法

def printTables(doc):
    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    print(paragraph.text)
                printTables(cell)

我试过了，并不是所有的文本都显示出来，只有列名出现，这就是它应该呈现的吗？这是不是docx在我的系统上无法正确打开？我想了解更多关于此代码的信息，以及如何解决我的问题。您是否可以知道，是否可以重构此函数并找到InlineShapes？我想引导未来读者了解此问题，因为这是API@ImaneE. 可能太晚了，但您的问题可能与打印表头有关。建议：您可以尝试将段落文本打印到列表中，并删除最有可能是表头的第一个元素？如果你能启动并运行它，请分享！

def printTables(doc):
    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    print(paragraph.text)
                printTables(cell)