将子节点与类似的父节点xml、python合并
我有以下xml文件:将子节点与类似的父节点xml、python合并,python,xml,Python,Xml,我有以下xml文件: <root> <article_date>09/09/2013 <article_time>1 <article_name>aaa1</article_name> <article_link>1aaaaaaa</article_link> </article_time> <article_time>0
<root>
<article_date>09/09/2013
<article_time>1
<article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>
<article_time>1
<article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>
<article_time>1
<article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>
</article_date>
</root>
我会尽可能多地写我有时间(和知识),但我正在把它变成一个社区维基,这样其他人可以帮助我 我建议为此使用或库。我将使用BeautifulSoup,因为由于某些原因,我现在无法让xml工作 首先,让我们开始准备:
>>> import bs4
>>> soup = bs4.BeautifulSoup('''<root>
... <article_date>09/09/2013
... <article_time>1
... <article_name>aaa1</article_name>
... <article_link>1aaaaaaa</article_link>
... </article_time>
... <article_time>0
... <article_name>aaa2</article_name>
... <article_link>2aaaaaaa</article_link>
... </article_time>
... <article_time>1
... <article_name>aaa3</article_name>
... <article_link>3aaaaaaa</article_link>
... </article_time>
... <article_time>0
... <article_name>aaa4</article_name>
... <article_link>4aaaaaaa</article_link>
... </article_time>
... <article_time>1
... <article_name>aaa5</article_name>
... <article_link>5aaaaaaa</article_link>
... </article_time>
... </root>''')
接下来要做的是定义一个键,用于定义“相似”父节点的方式。让我们编写一个键
函数,指定要查看每个子对象的哪个部分。我们先做一些调查,了解每个孩子的结构
>>> children[0].contents
[u'1\n ', <article_name>aaa1</article_name>, u'\n', <article_link>1aaaaaaa</article_link>, u'\n']
>>> children[0].contents[0]
u'1\n '
>>> int(children[0].contents[0])
1
>>> def key(child):
... return int(child.contents[0])
...
>>> key(children[0])
1
>>> key(children[1])
0
groups
是一个生成器——就像一个列表,但我们只能遍历它一次。让我们看看是什么造就了它,尽管这意味着我们必须在以后再创造它。(对于生成器,我们只获得一次传递,因此通过查看数据,我们正在丢失它。幸运的是,它很容易重新创建)
>对于k,g分组:
... 打印k':\t',列表(g)
...
0 : [0
aaa2
2aaaaaa
, 0
aaa4
4aaaaaaaaa
]
1 : [1
aaa1
1AAAAAA
1.
aaa3
3aaaaaa
1.
aaa5
5aaaaaaa
]
好的,k
指定用于生成组的键,g是匹配k
的article\u time
s序列
对不起,我现在没有时间了。希望这足以让您开始学习。以下是使用python标准库中的
xml.etree.ElementTree
的解决方案
其思想是将项目收集到defaultdict(list)
perarticle\u time
text值中:
from collections import defaultdict
import xml.etree.ElementTree as ET
data = """<root>
<article_date>09/09/2013
<article_time>1
<article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>
<article_time>1
<article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>
<article_time>1
<article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>
</article_date>
</root>
"""
tree = ET.fromstring(data)
root = ET.Element('root')
article_date = ET.SubElement(root, 'article_date')
article_date.text = tree.find('.//article_date').text
data = defaultdict(list)
for article_time in tree.findall('.//article_time'):
text = article_time.text.strip()
name = article_time.find('./article_name').text
link = article_time.find('./article_link').text
data[text].append((name, link))
for time_value, items in data.iteritems():
article_time = ET.SubElement(article_date, 'article_time')
article_name = ET.SubElement(article_time, 'article_name')
article_link = ET.SubElement(article_time, 'article_name')
article_time.text = time_value
article_name.text = '+'.join(name for (name, _) in items)
article_link.text = '+'.join(link for (_, link) in items)
print ET.tostring(root)
从集合导入defaultdict
将xml.etree.ElementTree作为ET导入
data=”“”
09/09/2013
1.
aaa1
1AAAAAA
0
aaa2
2aaaaaa
1.
aaa3
3aaaaaa
0
aaa4
4aaaaaaaaa
1.
aaa5
5aaaaaaa
"""
tree=ET.fromstring(数据)
root=ET.Element('root')
article\u date=ET.SubElement(根“article\u date”)
article\u date.text=tree.find('.//article\u date').text
数据=默认DICT(列表)
对于tree.findall('.//article\u time')中的article\u time:
text=article\u time.text.strip()
名称=文章时间。查找('./文章名称')。文本
link=文章时间。查找('./文章链接')。文本
数据[文本].追加((名称,链接))
对于时间_值,data.iteritems()中的项:
article_time=ET.SubElement(article_日期,“article_time”)
article_name=ET.SubElement(article_time,‘article_name’)
article\u link=ET.SubElement(article\u time,'article\u name')
article\u time.text=时间值
article_name.text='+'.join(项目中(名称,))的名称)
article_link.text='+'.join(项目中(_,link)的链接)
打印ET.tostring(根目录)
印刷品(美化):
09/09/2013
1.
aaa1+aaa3+aaa5
1AAAAAA+3AAAAA+5AAAAA
0
aaa2+aaa4
2aaaaaa+4aaaaaaaa
看,结果正是您想要的。到目前为止,您的代码是什么?感谢您发布解决方案。但是,有一个问题:在我的数据结构中,包装了所有其他标记。@mr.M ok,但我没有看到结束日期标记。。它应该在哪里?我的错。我已经修复了数据结构。
>>> children = soup.find_all('article_time')
>>> children
[<article_time>1
<article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>, <article_time>0
<article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>, <article_time>1
<article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>, <article_time>0
<article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>, <article_time>1
<article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>]
>>> children[0].contents
[u'1\n ', <article_name>aaa1</article_name>, u'\n', <article_link>1aaaaaaa</article_link>, u'\n']
>>> children[0].contents[0]
u'1\n '
>>> int(children[0].contents[0])
1
>>> def key(child):
... return int(child.contents[0])
...
>>> key(children[0])
1
>>> key(children[1])
0
>>> children = sorted(children, key=key)
>>> import itertools
>>> groups = itertools.groupby(children, key)
>>> for k, g in groups:
... print k, ':\t', list(g)
...
0 : [<article_time>0
<article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>, <article_time>0
<article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>]
1 : [<article_time>1
<article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>, <article_time>1
<article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>, <article_time>1
<article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>]
from collections import defaultdict
import xml.etree.ElementTree as ET
data = """<root>
<article_date>09/09/2013
<article_time>1
<article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>
<article_time>1
<article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>
<article_time>1
<article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>
</article_date>
</root>
"""
tree = ET.fromstring(data)
root = ET.Element('root')
article_date = ET.SubElement(root, 'article_date')
article_date.text = tree.find('.//article_date').text
data = defaultdict(list)
for article_time in tree.findall('.//article_time'):
text = article_time.text.strip()
name = article_time.find('./article_name').text
link = article_time.find('./article_link').text
data[text].append((name, link))
for time_value, items in data.iteritems():
article_time = ET.SubElement(article_date, 'article_time')
article_name = ET.SubElement(article_time, 'article_name')
article_link = ET.SubElement(article_time, 'article_name')
article_time.text = time_value
article_name.text = '+'.join(name for (name, _) in items)
article_link.text = '+'.join(link for (_, link) in items)
print ET.tostring(root)
<root>
<article_date>09/09/2013
<article_time>1
<article_name>aaa1+aaa3+aaa5</article_name>
<article_name>1aaaaaaa+3aaaaaaa+5aaaaaaa</article_name>
</article_time>
<article_time>0
<article_name>aaa2+aaa4</article_name>
<article_name>2aaaaaaa+4aaaaaaa</article_name>
</article_time>
</article_date>
</root>