Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/xml/12.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
将子节点与类似的父节点xml、python合并_Python_Xml - Fatal编程技术网

将子节点与类似的父节点xml、python合并

将子节点与类似的父节点xml、python合并,python,xml,Python,Xml,我有以下xml文件: <root> <article_date>09/09/2013 <article_time>1 <article_name>aaa1</article_name> <article_link>1aaaaaaa</article_link> </article_time> <article_time>0

我有以下xml文件:

<root>
    <article_date>09/09/2013
    <article_time>1
        <article_name>aaa1</article_name>
        <article_link>1aaaaaaa</article_link>
    </article_time>
    <article_time>0
        <article_name>aaa2</article_name>
        <article_link>2aaaaaaa</article_link>
    </article_time>
    <article_time>1
        <article_name>aaa3</article_name>
        <article_link>3aaaaaaa</article_link>
    </article_time>
    <article_time>0
        <article_name>aaa4</article_name>
        <article_link>4aaaaaaa</article_link>
    </article_time>
    <article_time>1
        <article_name>aaa5</article_name>
        <article_link>5aaaaaaa</article_link>
    </article_time>
    </article_date>
</root>

我会尽可能多地写我有时间(和知识),但我正在把它变成一个社区维基,这样其他人可以帮助我

我建议为此使用或库。我将使用BeautifulSoup,因为由于某些原因,我现在无法让xml工作

首先,让我们开始准备:

>>> import bs4
>>> soup = bs4.BeautifulSoup('''<root>
...     <article_date>09/09/2013
...     <article_time>1
...         <article_name>aaa1</article_name>
...         <article_link>1aaaaaaa</article_link>
...     </article_time>
...     <article_time>0
...         <article_name>aaa2</article_name>
...         <article_link>2aaaaaaa</article_link>
...     </article_time>
...     <article_time>1
...         <article_name>aaa3</article_name>
...         <article_link>3aaaaaaa</article_link>
...     </article_time>
...     <article_time>0
...         <article_name>aaa4</article_name>
...         <article_link>4aaaaaaa</article_link>
...     </article_time>
...     <article_time>1
...         <article_name>aaa5</article_name>
...         <article_link>5aaaaaaa</article_link>
...     </article_time>
... </root>''')
接下来要做的是定义一个键,用于定义“相似”父节点的方式。让我们编写一个
函数,指定要查看每个子对象的哪个部分。我们先做一些调查,了解每个孩子的结构

>>> children[0].contents
[u'1\n        ', <article_name>aaa1</article_name>, u'\n', <article_link>1aaaaaaa</article_link>, u'\n']
>>> children[0].contents[0]
u'1\n        '
>>> int(children[0].contents[0])
1
>>> def key(child):
...     return int(child.contents[0])
...
>>> key(children[0])
1
>>> key(children[1])
0
groups
是一个生成器——就像一个列表,但我们只能遍历它一次。让我们看看是什么造就了它,尽管这意味着我们必须在以后再创造它。(对于生成器,我们只获得一次传递,因此通过查看数据,我们正在丢失它。幸运的是,它很容易重新创建)

>对于k,g分组:
...     打印k':\t',列表(g)
...
0 : [0
aaa2
2aaaaaa
, 0
aaa4
4aaaaaaaaa
]
1 : [1
aaa1
1AAAAAA
1.
aaa3
3aaaaaa
1.
aaa5
5aaaaaaa
]
好的,
k
指定用于生成组的键,g是匹配
k
article\u time
s序列


对不起,我现在没有时间了。希望这足以让您开始学习。

以下是使用python标准库中的
xml.etree.ElementTree
的解决方案

其思想是将项目收集到
defaultdict(list)
per
article\u time
text值中:

from collections import defaultdict
import xml.etree.ElementTree as ET

data = """<root>
    <article_date>09/09/2013
    <article_time>1
        <article_name>aaa1</article_name>
        <article_link>1aaaaaaa</article_link>
    </article_time>
    <article_time>0
        <article_name>aaa2</article_name>
        <article_link>2aaaaaaa</article_link>
    </article_time>
    <article_time>1
        <article_name>aaa3</article_name>
        <article_link>3aaaaaaa</article_link>
    </article_time>
    <article_time>0
        <article_name>aaa4</article_name>
        <article_link>4aaaaaaa</article_link>
    </article_time>
    <article_time>1
        <article_name>aaa5</article_name>
        <article_link>5aaaaaaa</article_link>
    </article_time>
    </article_date>
</root>
"""

tree = ET.fromstring(data)

root = ET.Element('root')
article_date = ET.SubElement(root, 'article_date')
article_date.text = tree.find('.//article_date').text

data = defaultdict(list)
for article_time in tree.findall('.//article_time'):
    text = article_time.text.strip()
    name = article_time.find('./article_name').text
    link = article_time.find('./article_link').text
    data[text].append((name, link))

for time_value, items in data.iteritems():
    article_time = ET.SubElement(article_date, 'article_time')
    article_name = ET.SubElement(article_time, 'article_name')
    article_link = ET.SubElement(article_time, 'article_name')

    article_time.text = time_value
    article_name.text = '+'.join(name for (name, _) in items)
    article_link.text = '+'.join(link for (_, link) in items)

print ET.tostring(root)
从集合导入defaultdict
将xml.etree.ElementTree作为ET导入
data=”“”
09/09/2013
1.
aaa1
1AAAAAA
0
aaa2
2aaaaaa
1.
aaa3
3aaaaaa
0
aaa4
4aaaaaaaaa
1.
aaa5
5aaaaaaa
"""
tree=ET.fromstring(数据)
root=ET.Element('root')
article\u date=ET.SubElement(根“article\u date”)
article\u date.text=tree.find('.//article\u date').text
数据=默认DICT(列表)
对于tree.findall('.//article\u time')中的article\u time:
text=article\u time.text.strip()
名称=文章时间。查找('./文章名称')。文本
link=文章时间。查找('./文章链接')。文本
数据[文本].追加((名称,链接))
对于时间_值,data.iteritems()中的项:
article_time=ET.SubElement(article_日期,“article_time”)
article_name=ET.SubElement(article_time,‘article_name’)
article\u link=ET.SubElement(article\u time,'article\u name')
article\u time.text=时间值
article_name.text='+'.join(项目中(名称,))的名称)
article_link.text='+'.join(项目中(_,link)的链接)
打印ET.tostring(根目录)
印刷品(美化):


09/09/2013
1.
aaa1+aaa3+aaa5
1AAAAAA+3AAAAA+5AAAAA
0
aaa2+aaa4
2aaaaaa+4aaaaaaaa

看,结果正是您想要的。

到目前为止,您的代码是什么?感谢您发布解决方案。但是,有一个问题:在我的数据结构中,包装了所有其他标记。@mr.M ok,但我没有看到结束日期标记。。它应该在哪里?我的错。我已经修复了数据结构。
>>> children = soup.find_all('article_time')
>>> children
[<article_time>1
        <article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>, <article_time>0
        <article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>, <article_time>1
        <article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>, <article_time>0
        <article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>, <article_time>1
        <article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>]
>>> children[0].contents
[u'1\n        ', <article_name>aaa1</article_name>, u'\n', <article_link>1aaaaaaa</article_link>, u'\n']
>>> children[0].contents[0]
u'1\n        '
>>> int(children[0].contents[0])
1
>>> def key(child):
...     return int(child.contents[0])
...
>>> key(children[0])
1
>>> key(children[1])
0
>>> children = sorted(children, key=key)
>>> import itertools
>>> groups = itertools.groupby(children, key)
>>> for k, g in groups:
...     print k, ':\t', list(g)
...
0 : [<article_time>0
        <article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>, <article_time>0
        <article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>]
1 : [<article_time>1
        <article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>, <article_time>1
        <article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>, <article_time>1
        <article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>]
from collections import defaultdict
import xml.etree.ElementTree as ET

data = """<root>
    <article_date>09/09/2013
    <article_time>1
        <article_name>aaa1</article_name>
        <article_link>1aaaaaaa</article_link>
    </article_time>
    <article_time>0
        <article_name>aaa2</article_name>
        <article_link>2aaaaaaa</article_link>
    </article_time>
    <article_time>1
        <article_name>aaa3</article_name>
        <article_link>3aaaaaaa</article_link>
    </article_time>
    <article_time>0
        <article_name>aaa4</article_name>
        <article_link>4aaaaaaa</article_link>
    </article_time>
    <article_time>1
        <article_name>aaa5</article_name>
        <article_link>5aaaaaaa</article_link>
    </article_time>
    </article_date>
</root>
"""

tree = ET.fromstring(data)

root = ET.Element('root')
article_date = ET.SubElement(root, 'article_date')
article_date.text = tree.find('.//article_date').text

data = defaultdict(list)
for article_time in tree.findall('.//article_time'):
    text = article_time.text.strip()
    name = article_time.find('./article_name').text
    link = article_time.find('./article_link').text
    data[text].append((name, link))

for time_value, items in data.iteritems():
    article_time = ET.SubElement(article_date, 'article_time')
    article_name = ET.SubElement(article_time, 'article_name')
    article_link = ET.SubElement(article_time, 'article_name')

    article_time.text = time_value
    article_name.text = '+'.join(name for (name, _) in items)
    article_link.text = '+'.join(link for (_, link) in items)

print ET.tostring(root)
<root>
    <article_date>09/09/2013
        <article_time>1
            <article_name>aaa1+aaa3+aaa5</article_name>
            <article_name>1aaaaaaa+3aaaaaaa+5aaaaaaa</article_name>
        </article_time>
        <article_time>0
            <article_name>aaa2+aaa4</article_name>
            <article_name>2aaaaaaa+4aaaaaaa</article_name>
        </article_time>
    </article_date>
</root>