Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/algorithm/10.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python n-序列中的最大元素(需要保留重复项)_Python_Algorithm_Sorting_Heap_Sequence - Fatal编程技术网

Python n-序列中的最大元素(需要保留重复项)

Python n-序列中的最大元素(需要保留重复项),python,algorithm,sorting,heap,sequence,Python,Algorithm,Sorting,Heap,Sequence,我需要在元组列表中找到n个最大的元素。下面是前3个元素的示例 # I have a list of tuples of the form (category-1, category-2, value) # For each category-1, ***values are already sorted descending by default*** # The list can potentially be approximately a million elements long. lot

我需要在元组列表中找到n个最大的元素。下面是前3个元素的示例

# I have a list of tuples of the form (category-1, category-2, value)
# For each category-1, ***values are already sorted descending by default***
# The list can potentially be approximately a million elements long.
lot = [('a', 'x1', 10), ('a', 'x2', 9), ('a', 'x3', 9), 
       ('a', 'x4',  8), ('a', 'x5', 8), ('a', 'x6', 7),
       ('b', 'x1', 10), ('b', 'x2', 9), ('b', 'x3', 8), 
       ('b', 'x4',  7), ('b', 'x5', 6), ('b', 'x6', 5)]

# This is what I need. 
# A list of tuple with top-3 largest values for each category-1
ans = [('a', 'x1', 10), ('a', 'x2', 9), ('a', 'x3', 9), 
       ('a', 'x4', 8), ('a', 'x5', 8),
       ('b', 'x1', 10), ('b', 'x2', 9), ('b', 'x3', 8)]
# Here's how I built the sample list of 1 million entries.
lot = []
for i in range(1001):
    for j in reversed(range(333)):
        for k in range(3):
            lot.append((i, 'x', j))

# timeit Results for n = 10
brute_force = 6.55s
itertools = 2.07s
# clearly the itertools solution provided by mhyfritz is much faster.
我尝试使用
heapq.nlargest
。但是,它只返回前3个最大的元素,不返回重复的元素。比如说,

heapq.nlargest(3, [10, 10, 10, 9, 8, 8, 7, 6])
# returns
[10, 10, 10]
# I need
[10, 10, 10, 9, 8, 8]
我只能想到一种暴力手段。这就是我所拥有的,它是有效的

res, prev_t, count = [lot[0]], lot[0], 1
for t in lot[1:]:
    if t[0] == prev_t[0]:
        count = count + 1 if t[2] != prev_t[2] else count
        if count <= 3:
            res.append(t)   
    else:
        count = 1
        res.append(t)
    prev_t = t

print res
res,上一个,count=[lot[0]],lot[0],1
对于标段[1:]中的t:
如果t[0]==上一个t[0]:
计数=如果t[2]!=上一次[2]其他计数

如果count就是这个想法,那么制作一个dict,将您想要排序的值作为键,并将具有该值的元组列表作为值

然后按键对dict中的项进行排序,从顶部获取项,提取它们的值并将它们连接起来

快速、丑陋的代码:

>>> sum(
        map(lambda x: x[1],
            sorted(dict([(x[2], filter(lambda y: y[2] == x[2], lot))
                for x in lot]).items(),
                reverse=True)[:3]),
    [])

7: [('a', 'x1', 10),
 ('b', 'x1', 10),
 ('a', 'x2', 9),
 ('a', 'x3', 9),
 ('b', 'x2', 9),
 ('a', 'x4', 8),
 ('a', 'x5', 8),
 ('b', 'x3', 8)]

只是想给你一些想法,希望对你有所帮助。如果您需要一些澄清,请在评论中询问

如果您已经以这种方式对输入数据进行了排序,那么您的解决方案很可能比基于heapq的解决方案要好一点


您的算法复杂度是O(n),而基于heapq的算法在概念上是O(n*log(3)),它可能需要对数据进行更多的传递才能正确排列

这个怎么样?它不能准确地返回您想要的结果,因为它在
y
上反向排序

# split lot by first element of values
lots = defaultdict(list)
for x, y, z in lot:
    lots[x].append((y, z))

ans = []
for x, l in lots.iteritems():
    # find top-3 unique values
    top = nlargest(3, set(z for (y, z) in l))
    ans += [(x, y, z) for (z, y) in sorted([(z, y) for (y, z) in l
                                                   if z in top],
                                           reverse=True)]

print ans
结果:

{'a': {8: {('a', 'x5', 8), ('a', 'x4', 8)},
       9: {('a', 'x3', 9), ('a', 'x2', 9)},
       10: {('a', 'x1', 10)}},
 'b': {8: {('b', 'x3', 8)}, 
       9: {('b', 'x2', 9)}, 
       10: {('b', 'x1', 10)}}}
>>> topTiedThreeInEachCategory(myTuples)
{('b', 'x2', 9), ('a', 'x1', 10), ('b', 'x3', 8), ('a', 'x2', 9), ('a', 'x4', 8), ('a', 'x3', 9), ('a', 'x5', 8), ('b', 'x1', 10)}

平面列表:我使用上述方法是因为它提供了更多信息。若要仅获取一个平面列表,请使用闭包发出带有
onlyTopThreeKeys
的结果:

from collections import *

def topTiedThreeInEachCategory(tuples):
    categories = defaultdict(lambda: defaultdict(lambda: set()))
    for t in myTuples:
        cat1,cat2,val = t
        categories[cat1][val].add(t)

    reap = set()

    def sowTopThreeKeys(d):
        keys = sorted(d.keys())[-3:]
        for k in keys:
            for x in d[k]:
                reap.add(x)
    for sets in categories.values():
        sowTopThreeKeys(sets)

    return reap
结果:

{'a': {8: {('a', 'x5', 8), ('a', 'x4', 8)},
       9: {('a', 'x3', 9), ('a', 'x2', 9)},
       10: {('a', 'x1', 10)}},
 'b': {8: {('b', 'x3', 8)}, 
       9: {('b', 'x2', 9)}, 
       10: {('b', 'x1', 10)}}}
>>> topTiedThreeInEachCategory(myTuples)
{('b', 'x2', 9), ('a', 'x1', 10), ('b', 'x3', 8), ('a', 'x2', 9), ('a', 'x4', 8), ('a', 'x3', 9), ('a', 'x5', 8), ('b', 'x1', 10)}


如果您的输入保证按照示例输入进行排序,您也可以使用
itertools.groupby
,但如果排序发生变化,这将导致您的代码中断。

我从您的代码片段中得知,
批次
被分组为w.r.t.category-1。那么,以下几点应该起作用:

from itertools import groupby, islice
from operator import itemgetter

ans = []
for x, g1 in groupby(lot, itemgetter(0)):
    for y, g2 in islice(groupby(g1, itemgetter(2)), 0, 3):
        ans.extend(list(g2))

print ans
# [('a', 'x1', 10), ('a', 'x2', 9), ('a', 'x3', 9), ('a', 'x4', 8), ('a', 'x5', 8),
#  ('b', 'x1', 10), ('b', 'x2', 9), ('b', 'x3', 8)]

一些额外的细节。。。我对使用
itertools
和我的代码(暴力)进行了计时

以下是
n=10
和包含100万个元素的列表的
timeit
结果

# I have a list of tuples of the form (category-1, category-2, value)
# For each category-1, ***values are already sorted descending by default***
# The list can potentially be approximately a million elements long.
lot = [('a', 'x1', 10), ('a', 'x2', 9), ('a', 'x3', 9), 
       ('a', 'x4',  8), ('a', 'x5', 8), ('a', 'x6', 7),
       ('b', 'x1', 10), ('b', 'x2', 9), ('b', 'x3', 8), 
       ('b', 'x4',  7), ('b', 'x5', 6), ('b', 'x6', 5)]

# This is what I need. 
# A list of tuple with top-3 largest values for each category-1
ans = [('a', 'x1', 10), ('a', 'x2', 9), ('a', 'x3', 9), 
       ('a', 'x4', 8), ('a', 'x5', 8),
       ('b', 'x1', 10), ('b', 'x2', 9), ('b', 'x3', 8)]
# Here's how I built the sample list of 1 million entries.
lot = []
for i in range(1001):
    for j in reversed(range(333)):
        for k in range(3):
            lot.append((i, 'x', j))

# timeit Results for n = 10
brute_force = 6.55s
itertools = 2.07s
# clearly the itertools solution provided by mhyfritz is much faster.
如果有人好奇的话,下面是他的代码是如何工作的

+ Outer loop - x, g1
| a [('a', 'x1', 10), ('a', 'x2', 9), ('a', 'x3', 9), ('a', 'x4', 8), ('a', 'x5', 8), ('a', 'x6', 7)]
+-- Inner loop - y, g2
  |- 10 [('a', 'x1', 10)]
  |- 9 [('a', 'x2', 9), ('a', 'x3', 9)]
  |- 8 [('a', 'x4', 8), ('a', 'x5', 8)]
+ Outer loop - x, g1
| b [('b', 'x1', 10), ('b', 'x2', 9), ('b', 'x3', 8), ('b', 'x4', 7), ('b', 'x5', 6), ('b', 'x6', 5)]
+-- Inner loop - y, g2
  |- 10 [('b', 'x1', 10)]
  |- 9 [('b', 'x2', 9)]
  |- 8 [('b', 'x3', 8)]

这难道不能同时提取这两个条目:
('a','x2',9),('a','x3',9)
?@Toader:这就是为什么我已经提供了示例输出,您可以自己查找,现在它可以工作了,因为这样:
类别[cat1][val]。添加(t)
。当我发表评论时,它是
categories[cat1][val]=t
:)@Toader:噢,糟糕,对不起,我不知道你的评论是20分钟前的@烤面包师Mihai Claudiu是对的。您可以尝试的一种优化方法是将所有键拆分为不同的列表,并在从每个列表中选择前三个键后退出循环。这样,您就不必遍历整个列表。(这是假设您一开始没有花时间排序。在没有排序约束的情况下,堆解决方案应该工作得最好)和一行代码:
list(chain(*(list(g2)表示x,g1表示groupby(lot,itemgetter(0))表示y,g2表示islice(groupby(g1,itemgetter(2)),0,3)))
在第二个循环中,
islice
groupby
的组合非常出色!感谢您提供了一个很好的解决方案!是的。这看起来很漂亮:)主要警告:如果输入的顺序发生变化,代码将来会崩溃。@Praveen Gollakota:很高兴我能帮上忙。另外,感谢您提供更多详细信息(
timeit
comparison,trace)。