尝试使用Python解析xml格式的docx文档以打印粗体字_Python_Xml

尝试使用Python解析xml格式的docx文档以打印粗体字

python xml

尝试使用Python解析xml格式的docx文档以打印粗体字,python,xml,Python,Xml,我有一个word docx文件，我想打印粗体的单词查看xml格式的文档。我要打印的单词似乎具有以下属性 <w:r w:rsidRPr="00510F21"> <w:rPr><w:b/> <w:noProof/> <w:sz w:val="22"/> <w:szCs w:val="22"/> </w:rPr> <w:t>Print this Sentence&l

我有一个word docx文件，我想打印粗体的单词查看xml格式的文档。我要打印的单词似乎具有以下属性

<w:r w:rsidRPr="00510F21">
  <w:rPr><w:b/>
     <w:noProof/>
     <w:sz w:val="22"/>
     <w:szCs w:val="22"/>
  </w:rPr>
  <w:t>Print this Sentence</w:t>
</w:r>

在做了一些研究并尝试使用PythonDocx库完成这项工作之后，我决定尝试使用

lxml

。我得到一个关于名称空间的错误，并试图添加该名称空间，但它返回一个空集。下面是文档中的一些名称空间内容

<w:document
xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" 
xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" 
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
xmlns:mv="urn:schemas-microsoft-com:mac:vml" 
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" 
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"  xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" 
xmlns:w10="urn:schemas-microsoft-com:office:word" 
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" 
xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"            xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk"
xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" 
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
mc:Ignorable="w14 w15 wp14">

from lxml import etree
doc = etree.parse("document.xml")
root = doc.getroot()

for wr_roots in doc.xpath('//w:r', namespaces=root.nsmap):
    if wr_roots.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRPr')\
       == '00510F21':
        print(wr_roots.find('w:t', namespaces=root.nsmap).text)

# Print this stuff

考虑lxml的

xpath（）

方法。调用

。get（）

检索属性，而

。find（）

检索节点。由于XML在属性中有名称空间，因此需要在

.get（）

调用中为URI添加前缀。最后，使用

.nsmap

对象检索文档根目录下的所有名称空间

<w:document
xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" 
xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" 
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
xmlns:mv="urn:schemas-microsoft-com:mac:vml" 
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" 
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"  xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" 
xmlns:w10="urn:schemas-microsoft-com:office:word" 
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" 
xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"            xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk"
xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" 
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
mc:Ignorable="w14 w15 wp14">

from lxml import etree
doc = etree.parse("document.xml")
root = doc.getroot()

for wr_roots in doc.xpath('//w:r', namespaces=root.nsmap):
    if wr_roots.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRPr')\
       == '00510F21':
        print(wr_roots.find('w:t', namespaces=root.nsmap).text)

# Print this stuff

如果要查找所有粗体文本，可以使用带有表达式的

findall（）

：

from lxml import etree

namespaces = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

root = etree.parse('document.xml').getroot()
for e in root.findall('.//w:r/w:rPr/w:b/../../w:t', namespaces):
    print(e.text)

与其查找属性为

w:rsidRPr=“00510F21”

的

w:r

节点（我不相信它表示粗体文本），不如在运行属性标记（

w:rPr

）中查找带有

w:b

的运行节点（

w:r

），然后访问其中的文本标记（

w:t

）。

w:b

标记是粗体属性，如下所示

xpath表达式可以简化为

。//w:b/../../w:t'

，尽管这不够严格，可能会导致错误匹配。

为什么不使用python docx？首先，docx还不能处理更改跟踪的插入和删除。