Python 如何确保xml.dom.minidom能够解析自己的输出？_Python_Xml_Dom_Escaping

Python 如何确保xml.dom.minidom能够解析自己的输出？

python xml dom

Python 如何确保xml.dom.minidom能够解析自己的输出？,python,xml,dom,escaping,Python,Xml,Dom,Escaping,我试图以一种可以读回的方式将一些数据序列化为xml。为此，我通过xml.DOM.minidom手动构建DOM，并使用附带的writexml方法将其写入文件我特别感兴趣的是如何构建文本节点。我通过初始化一个文本对象，然后设置它的数据属性来实现这一点。我不确定文本对象为什么不在构造函数中获取其内容，但这只是它在xml.dom.minidom中简化的方式举个具体的例子，代码如下所示： import xml.dom.minidom as dom e = dom.Element('node') t =

我试图以一种可以读回的方式将一些数据序列化为xml。为此，我通过xml.DOM.minidom手动构建DOM，并使用附带的writexml方法将其写入文件

我特别感兴趣的是如何构建文本节点。我通过初始化一个文本对象，然后设置它的数据属性来实现这一点。我不确定文本对象为什么不在构造函数中获取其内容，但这只是它在xml.dom.minidom中简化的方式

举个具体的例子，代码如下所示：

import xml.dom.minidom as dom
e = dom.Element('node')
t = dom.Text()
t.data = "The text content"
e.appendChild(t)
dom.parseString(e.toxml())

def createTextNode(self, data):
    if not isinstance(data, StringTypes):
        raise TypeError, "node contents must be a string"
    t = Text()
    t.data = data
    t.ownerDocument = self
    return t

这对我来说似乎是合理的，特别是因为createTextNode本身的实现方式与此完全相同：

import xml.dom.minidom as dom
e = dom.Element('node')
t = dom.Text()
t.data = "The text content"
e.appendChild(t)
dom.parseString(e.toxml())

def createTextNode(self, data):
    if not isinstance(data, StringTypes):
        raise TypeError, "node contents must be a string"
    t = Text()
    t.data = data
    t.ownerDocument = self
    return t

问题是，这样设置数据允许我们编写以后无法解析回的文本。举个例子，我对以下角色有困难：

you´ll

报价单为ord（180），“\xb4”。我的问题是，将这些数据编码成xml文档的正确过程是什么？我使用minidom解析文档以恢复原始树？

正如Python中所解释的，您遇到的问题是Unicode编码：

Node.toxml([encoding])
Return the XML that the DOM represents as a string.

With no argument, the XML header does not specify an encoding, and the result is
Unicode string if the default encoding cannot represent all characters in the 
document. Encoding this string in an encoding other than UTF-8 is likely
incorrect, since UTF-8 is the default encoding of XML.

With an explicit encoding [1] argument, the result is a byte string in the 
specified encoding. It is recommended that this argument is always specified.
To avoid UnicodeError exceptions in case of unrepresentable text data, the 
encoding argument should be specified as “utf-8”.

因此，调用

.toxml（'utf8'）

，而不仅仅是

.toxml（）

，并使用unicode字符串作为文本内容，您应该可以按照自己的意愿进行“往返”。例如：

>>> t.data = u"The text\u0180content"
>>> dom.parseString(e.toxml('utf8')).toxml('utf8')
'<?xml version="1.0" encoding="utf8"?><node>The text\xc6\x80content</node>'
>>>

>t.data=u“文本\u0180content”
>>>parseString（e.toxml（'utf8'））.toxml（'utf8'））
'文本\xc6\x80内容'
>>>

正如Python中所解释的，您遇到的问题是Unicode编码：

Node.toxml([encoding])
Return the XML that the DOM represents as a string.

With no argument, the XML header does not specify an encoding, and the result is
Unicode string if the default encoding cannot represent all characters in the 
document. Encoding this string in an encoding other than UTF-8 is likely
incorrect, since UTF-8 is the default encoding of XML.

With an explicit encoding [1] argument, the result is a byte string in the 
specified encoding. It is recommended that this argument is always specified.
To avoid UnicodeError exceptions in case of unrepresentable text data, the 
encoding argument should be specified as “utf-8”.

因此，调用

.toxml（'utf8'）

，而不仅仅是

.toxml（）

，并使用unicode字符串作为文本内容，您应该可以按照自己的意愿进行“往返”。例如：

>>> t.data = u"The text\u0180content"
>>> dom.parseString(e.toxml('utf8')).toxml('utf8')
'<?xml version="1.0" encoding="utf8"?><node>The text\xc6\x80content</node>'
>>>

>t.data=u“文本\u0180content”
>>>parseString（e.toxml（'utf8'））.toxml（'utf8'））
'文本\xc6\x80内容'
>>>