Python用Etree替换XML内容
我想用Python Etree解析器解析和比较2个XML文件,如下所示: 我有2个包含大量数据的XML文件。一个是英文(源文件),另一个是相应的法文翻译(目标文件)。 例如: 源文件:Python用Etree替换XML内容,python,elementtree,Python,Elementtree,我想用Python Etree解析器解析和比较2个XML文件,如下所示: 我有2个包含大量数据的XML文件。一个是英文(源文件),另一个是相应的法文翻译(目标文件)。 例如: 源文件: <AB> <CD/> <EF> <GH> <id>123</id> <IJ>xyz</IJ> <KL>DOG</KL> <
<AB>
<CD/>
<EF>
<GH>
<id>123</id>
<IJ>xyz</IJ>
<KL>DOG</KL>
<MN>dogs/dog</MN>
some more tags and info on same level
<metadata>
<entry>
<cl>Translation</cl>
<cl>English:dog/dogs</cl>
</entry>
<entry>
<string>blabla</string>
<string>blabla</string>
</entry>
some more strings and entries
</metadata>
</GH>
</EF>
<stuff/>
<morestuff/>
<otherstuff/>
<stuffstuff/>
<blubb/>
<bla/>
<blubbbla>8</blubbla>
</AB>
我不仅仅是个傻瓜,阅读其他关于这个的帖子更让我困惑。如果有人能启发我,我将不胜感激:-)可能还有更多细节需要澄清。下面是一些调试打印的示例,展示了这个想法。它假定两个文件具有完全相同的结构,并且您只希望位于根目录下一个级别:
import xml.etree.ElementTree as etree
english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')
# Get the root elements, as they support iteration
# through their children (direct descendants)
english_root = english_tree.getroot()
french_root = french_tree.getroot()
# Iterate through the direct descendants of the root
# elements in both trees in parallel.
for en, fr in zip(english_root, french_root):
assert en.tag == fr.tag # check for the same structure
if en.tag == 'id':
assert en.text == fr.text # check for the same id
elif en.tag == 'string':
if fr.text is None:
fr.text = en.text
print en.text # displaying what was replaced
etree.dump(french_tree)
对于更复杂的文件结构,通过节点的直接子级的循环可以被通过树的所有元素的迭代所代替。如果文件的结构完全相同,则以下代码将起作用:
import xml.etree.ElementTree as etree
english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')
for en, fr in zip(english_tree.iter(), french_tree.iter()):
assert en.tag == fr.tag # check if the structure is the same
if en.tag == 'id':
assert en.text == fr.text # identification must be the same
elif en.tag == 'string':
if fr.text is None:
fr.text = en.text
print en.text # display the inserted text
# Write the result to the output file.
with open('fr2.xml', 'w') as fout:
fout.write(etree.tostring(french_tree.getroot()))
但是,它仅在两个文件的结构完全相同的情况下才起作用。让我们按照手动完成任务时使用的算法进行操作。首先,我们需要找到空的法语翻译。然后,应将其替换为具有相同标识的GH元素的英文翻译。在搜索元素时使用XPath表达式的子集:
import xml.etree.ElementTree as etree
def find_translation(tree, id_):
# Search fot the GH element with the given identification, and return
# its translation if found. Otherwise None is returned implicitly.
for gh in tree.iter('GH'):
id_elem = gh.find('./id')
if id_ == id_elem.text:
# The related GH element found.
# Find metadata entry, extract the translation.
# Warning! This is simplification for the fixed position
# of the Translation entry.
me = gh.find('./metadata/entry')
assert len(me) == 2 # metadata/entry has two elements
cl1 = me[0]
assert cl1.text == 'Translation'
cl2 = me[1]
return cl2.text
# Body of the program. --------------------------------------------------
english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')
for gh in french_tree.iter('GH'): # iterate through the GH elements only
# Get the identification of the GH section
id_elem = gh.find('./id')
id_ = id_elem.text
# Find and check the metadata entry, extract the French translation.
# Warning! This is simplification for the fixed position of the Translation
# entry.
me = gh.find('./metadata/entry')
assert len(me) == 2 # metadata/entry has two elements
cl1 = me[0]
assert cl1.text == 'Translation'
cl2 = me[1]
fr_translation = cl2.text
# If the French translation is empty, put there the English translation
# from the related element.
if cl2.text is None:
cl2.text = find_translation(english_tree, id_)
with open('fr2.xml', 'w') as fout:
fout.write(etree.tostring(french_tree.getroot()).decode('utf-8'))
最好附加一个更复杂的示例来帮助解决方案的正确性。现在是使用XPath的时候了(标准的
xml.etree.ElementTree
只支持它的一些特性,但它们的功能足够强大)。尝试修改后的答案(最后一部分)。修复输入/输出文件的名称。然后我建议清理这里的评论,使其更具可读性,对其他人更有用。对……如果翻译条目不固定,我可以将翻译周围的“条目”标记重命名为独特的名称,并以这种方式找到它,还是不建议这样做(因为我试过了,但没有成功,但我想知道这是否是正确的方向?)标记重命名一般不应该进行。如果标记/元素有自己的特殊名称,则更好。这种方式
不是一个很好的示例。但我理解用户可能决定以交互方式插入该列,并且底层软件无法猜出用户想要什么。
import xml.etree.ElementTree as etree
english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')
for en, fr in zip(english_tree.iter(), french_tree.iter()):
assert en.tag == fr.tag # check if the structure is the same
if en.tag == 'id':
assert en.text == fr.text # identification must be the same
elif en.tag == 'string':
if fr.text is None:
fr.text = en.text
print en.text # display the inserted text
# Write the result to the output file.
with open('fr2.xml', 'w') as fout:
fout.write(etree.tostring(french_tree.getroot()))
import xml.etree.ElementTree as etree
def find_translation(tree, id_):
# Search fot the GH element with the given identification, and return
# its translation if found. Otherwise None is returned implicitly.
for gh in tree.iter('GH'):
id_elem = gh.find('./id')
if id_ == id_elem.text:
# The related GH element found.
# Find metadata entry, extract the translation.
# Warning! This is simplification for the fixed position
# of the Translation entry.
me = gh.find('./metadata/entry')
assert len(me) == 2 # metadata/entry has two elements
cl1 = me[0]
assert cl1.text == 'Translation'
cl2 = me[1]
return cl2.text
# Body of the program. --------------------------------------------------
english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')
for gh in french_tree.iter('GH'): # iterate through the GH elements only
# Get the identification of the GH section
id_elem = gh.find('./id')
id_ = id_elem.text
# Find and check the metadata entry, extract the French translation.
# Warning! This is simplification for the fixed position of the Translation
# entry.
me = gh.find('./metadata/entry')
assert len(me) == 2 # metadata/entry has two elements
cl1 = me[0]
assert cl1.text == 'Translation'
cl2 = me[1]
fr_translation = cl2.text
# If the French translation is empty, put there the English translation
# from the related element.
if cl2.text is None:
cl2.text = find_translation(english_tree, id_)
with open('fr2.xml', 'w') as fout:
fout.write(etree.tostring(french_tree.getroot()).decode('utf-8'))