Python 将长度列表划分为平衡块_Python_Optimization_Counter

Python 将长度列表划分为平衡块

python optimization

Python 将长度列表划分为平衡块,python,optimization,counter,Python,Optimization,Counter,我必须用Python写一个脚本。我有一个很长的整数列表，它们都是特定度量的长度，当然也有重复。我必须找到最佳的“间隔”来获得平衡的块。一个例子 [1,2,2,5,2,4,5,4,5] 使用计数器并对得到的结果排序 [(1,1)(2,3)(3,1)(4,1)(5,3)] 如果我需要两个bucket，我计算元素的数量（本例中为8），然后将这个数字除以bucket的数量（4），所以我需要用大约4个元素组成bucket。在我的代码中，我解析元组列表，对元素数求和，直到这个数大于4，所以 (1,

我必须用Python写一个脚本。我有一个很长的整数列表，它们都是特定度量的长度，当然也有重复。我必须找到最佳的“间隔”来获得平衡的块。一个例子

[1,2,2,5,2,4,5,4,5]

使用计数器并对得到的结果排序

[(1,1)(2,3)(3,1)(4,1)(5,3)]

如果我需要两个bucket，我计算元素的数量（本例中为8），然后将这个数字除以bucket的数量（4），所以我需要用大约4个元素组成bucket。在我的代码中，我解析元组列表，对元素数求和，直到这个数大于4，所以

(1,1) >= 4? False
(1,1) + (2,3) = 4 >=4? True, break;

所以第一个间隔是1-2，比

(3,1) >=4? False
(3,1)+(4,1) >=4? False
(3,1)+(4,1)+(5,3) >=4? True

所以第二个间隔是3-5秒在我的数据集中，我有数十万个元素，所以这个任务（计数、排序、解析）非常耗时。

有没有办法加快速度

下面是一个创建大小大致相等的连续存储桶的方法。它充分利用了标准库，使用

collections.Counter

，

heapq.merge

，

itertools.acculate

和

itertools.groupby

from itertools import groupby, accumulate
from heapq import merge
from collections import Counter
from math import sin, pi
import random

# make test data a bit uneven
def mock_data(N):
    return [int(sin(2*pi*random.random())*50 + 50) for _ in range(N)]

N = 1000000

data = mock_data(N)

counts = Counter(data)
srtcnts = sorted(counts.items())

k = 7 # number of buckets

slabels, scounts = zip(*srtcnts)
# compute cumulative bin centers
bincntrs = (a - c/2 for a, c in zip(accumulate(scounts), scounts))
# mix in the optimal boundaries
split = merge(zip(bincntrs, slabels), zip(range(0, N, -(-N//k))))
# group into boundaries and stuff between boundaries;
# keep only the stuff between
res = [[v[1] for v in grp] for k, grp in groupby(split, len) if k==2]

print(res)
# show they are balanced
print([sum(counts[i] for i in chunk) for chunk in res])

样本输出：

[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38], [39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60], [61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80], [81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94], [95, 96, 97, 98, 99]]
[143297, 143387, 142010, 141358, 143224, 143617, 143107]

非常确定itertools内置了一个bucket类型的东西请也分享一下你的最终列表应该是什么样子？