Python2.6—高效地删除和计算字典列表中的重复项_Python

Python2.6—高效地删除和计算字典列表中的重复项

python

Python2.6—高效地删除和计算字典列表中的重复项,python,Python,我正在努力有效地改变： [{'text': 'hallo world', 'num': 1}, {'text': 'hallo world', 'num': 2}, {'text': 'hallo world', 'num': 1}, {'text': 'haltlo world', 'num': 1}, {'text': 'hallo world', 'num': 1}, {'text': 'hallo world', 'num': 1}, {'text': 'hall

我正在努力有效地改变：

[{'text': 'hallo world', 'num': 1}, 
 {'text': 'hallo world', 'num': 2}, 
 {'text': 'hallo world', 'num': 1}, 
 {'text': 'haltlo world', 'num': 1}, 
 {'text': 'hallo world', 'num': 1}, 
 {'text': 'hallo world', 'num': 1}, 
 {'text': 'hallo world', 'num': 1}]

进入没有重复项和重复项计数的词典列表：

[{'text': 'hallo world', 'num': 2, 'count':1}, 
 {'text': 'hallo world', 'num': 1, 'count':5}, 
 {'text': 'haltlo world', 'num': 1, 'count':1}]

result = [dict(tupleized) for tupleized in set(tuple(item.items()) for item in li)]

到目前为止，我需要找到以下重复项：

[{'text': 'hallo world', 'num': 2, 'count':1}, 
 {'text': 'hallo world', 'num': 1, 'count':5}, 
 {'text': 'haltlo world', 'num': 1, 'count':1}]

result = [dict(tupleized) for tupleized in set(tuple(item.items()) for item in li)]

它返回：

[{'text': 'hallo world', 'num': 2}, 
 {'text': 'hallo world', 'num': 1}, 
 {'text': 'haltlo world', 'num': 1}]

谢谢

我将使用

itertools

中我最喜欢的一个：

from itertools import groupby

def canonicalize_dict(x):
    "Return a (key, value) list sorted by the hash of the key"
    return sorted(x.items(), key=lambda x: hash(x[0]))

def unique_and_count(lst):
    "Return a list of unique dicts with a 'count' key added"
    grouper = groupby(sorted(map(canonicalize_dict, lst)))
    return [dict(k + [("count", len(list(g)))]) for k, g in grouper]

a = [{'text': 'hallo world', 'num': 1},  
     #....
     {'text': 'hallo world', 'num': 1}]

print unique_and_count(a)

输出

[{'count': 5, 'text': 'hallo world', 'num': 1}, 
{'count': 1, 'text': 'hallo world', 'num': 2}, 
{'count': 1, 'text': 'haltlo world', 'num': 1}]

正如gnibbler指出的，

d1.items（）

和

d2.items（）

可能具有不同的键顺序，因此我引入了

canonical_dict

函数来解决这个问题。

注意：现在使用的是

frozenset

，这意味着字典中的项必须是可散列的

>>> from collections import defaultdict
>>> from itertools import chain
>>> data = [{'text': 'hallo world', 'num': 1}, {'text': 'hallo world', 'num': 2},  {'text': 'hallo world', 'num': 1}, {'text': 'haltlo world', 'num': 1}, {'text': 'hallo world', 'num': 1}, {'text': 'hallo world', 'num': 1}, {'text': 'hallo world', 'num': 1}]
>>> c = defaultdict(int)
>>> for d in data:
        c[frozenset(d.iteritems())] += 1


>>> [dict(chain(k, (('count', count),))) for k, count in c.iteritems()]
[{'count': 1, 'text': 'haltlo world', 'num': 1}, {'count': 1, 'text': 'hallo world', 'num': 2}, {'count': 5, 'text': 'hallo world', 'num': 1}]

想要不使用任何内置设备的简单解决方案

>>> d = [{'text': 'hallo world', 'num': 1}, 
...  {'text': 'hallo world', 'num': 2}, 
...  {'text': 'hallo world', 'num': 1}, 
...  {'text': 'haltlo world', 'num': 1}, 
...  {'text': 'hallo world', 'num': 1}, 
...  {'text': 'hallo world', 'num': 1}, 
...  {'text': 'hallo world', 'num': 1}]
>>> 
>>> def unique_counter(filesets):
...      for i in filesets:
...          i['count'] = sum([1 for j in filesets if j['num'] == i['num']])
...      return {k['num']:k for k in filesets}.values()
... 
>>> unique_counter(d)
[{'count': 6, 'text': 'hallo world', 'num': 1}, {'count': 1, 'text': 'hallo world', 'num': 2}]

我建议您使用

集合。计数器

但是

dict

类型是不可散列的：（。如果您可以使用散列函数将这些dict转换为类似于dict的对象，那么

计数器

在这里效果会很好。您可以基于

集

s编写自己的算法。set（'ABC'）-set（ABC）=set（[]）谢谢。我也在使用python 2.6。计数器可用于v2.7+

tuple（items.items（））

无法正常工作，因为即使dict相等，

items（）

的顺序并不总是相同的。@gnibbler如果每个dict都有相同的键，那么它的顺序不是总是相同的吗？回答很好。唯一需要注意的是，你需要知道字段的名称。无论如何，非常感谢。将使用上面@lazyr的解决方案。@tr33小时你不需要知道它们，我只是很明确，我会更改它now@tr33hous它不需要知道现在的字段。还要注意，这个解决方案在O（n）中运行，而LaZyr的解决方案使用了一个O（n log n）的排序。如果你正在处理大的列表，你将需要考虑这个问题。正如gnibbler在问题注释中指出的，<代码> d迭代（）不能保证以相同的顺序返回所有字典中的键。是的，只是看到了编辑。抱歉，没有花更多的时间来研究解决方案。+1如果所有键都可排序，则排序是有效的。可以进行散列，但不能进行排序-例如复数。您在

唯一\u和\u计数方面有一个小错误

-应该是这样的和a中x的not

。

@Zaar谢谢，我在前面重构代码时错过了这一点。