Python 从列表中找出要达到一定覆盖率所需的元素数_Python_List_Counter

Python 从列表中找出要达到一定覆盖率所需的元素数

python list

Python 从列表中找出要达到一定覆盖率所需的元素数,python,list,counter,Python,List,Counter,我有一个长列表x=[4,6,7,8,8,8,9,0,9,1,7,7] 我知道我可以使用计数器来查看项目出现的次数 x = [4,6,7,8,8,8,9,0,9,1,7,7] from collections import Counter Counter(x) >>Counter({0: 1, 1: 1, 4: 1, 6: 1, 7: 3, 8: 3, 9: 2}) 我可以使用以下方法对它们进行排序： Counter(x).most_common() >>Counte

我有一个长列表x=[4,6,7,8,8,8,9,0,9,1,7,7] 我知道我可以使用计数器来查看项目出现的次数

x = [4,6,7,8,8,8,9,0,9,1,7,7]
from collections import Counter
Counter(x)

>>Counter({0: 1, 1: 1, 4: 1, 6: 1, 7: 3, 8: 3, 9: 2})

我可以使用以下方法对它们进行排序：

Counter(x).most_common()

>>Counter(x).most_common()
Out[33]: [(7, 3), (8, 3), (9, 2), (0, 1), (1, 1), (4, 1), (6, 1)]

现在，我想知道我需要多少元素来覆盖我列表的50%。例如，7和8出现6次，共有12个元素，因此我只需要7和8就可以覆盖列表中50%的元素。如果我加9，我有8个元素，所以7、8和9覆盖了列表中66%的元素

如果我的列表中有数十万个元素，我该怎么做呢？

我只需迭代最常见的元素并累积这些元素，直到达到列表长度的给定百分比：

我只需迭代最常见的项，并累积这些项，直到达到列表长度的给定百分比：

如果我的列表中有数十万个元素

您可以编写一个生成函数，生成项目，直到超过计数百分比。生成器函数只响应迭代，它们从不在内存中收集结果，因此无论数据大小如何，函数的内存占用都是最小的：

def func(lst, percentage=0.5):
    cnt = 0
    for x, y in Counter(lst).most_common():
        cnt += y
        if cnt > len(lst)*percentage:
            return
        yield x

for p in func(x):
    print(p)
# 7
# 8

如果我的列表中有数十万个元素

def func(lst, percentage=0.5):
    cnt = 0
    for x, y in Counter(lst).most_common():
        cnt += y
        if cnt > len(lst)*percentage:
            return
        yield x

for p in func(x):
    print(p)
# 7
# 8

如果您愿意使用numpy，则不需要循环，并使用诸如装箱、排序和计数之类的概念来计算结果：

thresh = 0.5

vals, counts = np.unique(x, return_counts=True)
idx = counts.argsort()
vals = vals[idx][::-1]
w = np.where(np.cumsum(counts[idx][::-1]/len(x)) > thresh)[0][0]
print(vals[range(w)])

# for x = [4,6,7,8,8,8,9,0,9,1,7,7]
# the result is: [8, 7]

与@Moses的性能比较

# large array
x = np.random.randint(0, 1000, 10000)

# @Moses : 
timeit.timeit("moses()", setup="from __main__ import func, moses", number=1000)
Out[8]: 1.9789454049896449

# @this :
timeit.timeit("f1()", setup="from __main__ import f1", number=1000)
Out[6]: 0.5699292980134487

如果您愿意使用numpy，则不需要循环，并使用诸如装箱、排序和计数之类的概念来计算结果：

thresh = 0.5

vals, counts = np.unique(x, return_counts=True)
idx = counts.argsort()
vals = vals[idx][::-1]
w = np.where(np.cumsum(counts[idx][::-1]/len(x)) > thresh)[0][0]
print(vals[range(w)])

# for x = [4,6,7,8,8,8,9,0,9,1,7,7]
# the result is: [8, 7]

与@Moses的性能比较

# large array
x = np.random.randint(0, 1000, 10000)

# @Moses : 
timeit.timeit("moses()", setup="from __main__ import func, moses", number=1000)
Out[8]: 1.9789454049896449

# @this :
timeit.timeit("f1()", setup="from __main__ import f1", number=1000)
Out[6]: 0.5699292980134487

你想用最少的元素覆盖50%的列表吗？@JoeIddon是的，这就是我想要的，或者任何百分比你想用最少的元素覆盖50%的列表吗？@JoeIddon是的，这就是我想要的，或者任何百分比发电机的良好使用：+1发电机的良好使用：+1