Python 将numpy数组中的连续值与其长度分组

Python 将numpy数组中的连续值与其长度分组,python,arrays,performance,numpy,Python,Arrays,Performance,Numpy,在numpy/scipy(或者纯python,如果您愿意的话)中,在numpy数组中分组连续区域并计算这些区域的长度的好方法是什么 大概是这样的: x = np.array([1,1,1,2,2,3,0,0,0,0,0,1,2,3,1,1,0,0,0]) y = contiguousGroup(x) print y >> [[1,3], [2,2], [3,1], [0,5], [1,1], [2,1], [3,1], [1,2], [0,3]] 我试着用循环来做这件事,但是做一

在numpy/scipy(或者纯python,如果您愿意的话)中,在numpy数组中分组连续区域并计算这些区域的长度的好方法是什么

大概是这样的:

x = np.array([1,1,1,2,2,3,0,0,0,0,0,1,2,3,1,1,0,0,0])
y = contiguousGroup(x)
print y

>> [[1,3], [2,2], [3,1], [0,5], [1,1], [2,1], [3,1], [1,2], [0,3]]
我试着用循环来做这件事,但是做一个包含大约3000万个样本和20000个连续区域的列表所需的时间比我想要的要长(6秒)

编辑:

现在进行一些速度比较(只需使用time.clock()和几百次迭代,如果以秒为单位,则迭代次数更少)

首先,我的python循环代码在5个示例上进行了测试

Number of elements  33718251
Number of regions  135137
Time taken = 8.644007 seconds...

Number of elements  42503100
Number of regions  6985
Time taken = 10.533305 seconds...

Number of elements  21841302
Number of regions  7619335
Time taken = 7.671015 seconds...

Number of elements  19723928
Number of regions  10799
Time taken = 5.014807 seconds...

Number of elements  16619539
Number of regions  19293
Time taken = 4.207359 seconds...
现在是Divakar的矢量化解决方案

Number of elements  33718251
Number of regions  135137
Time taken = 0.063470 seconds...

Number of elements  42503100
Number of regions  6985
Time taken = 0.046293 seconds...

Number of elements  21841302
Number of regions  7619335
Time taken = 1.654288 seconds...

Number of elements  19723928
Number of regions  10799
Time taken = 0.022651 seconds...

Number of elements  16619539
Number of regions  19293
Time taken = 0.021189 seconds...
修改后的方法给出的时间大致相同(最坏情况下可能慢5%)

现在是从Kasramvd来的发电机

Number of elements  33718251
Number of regions  135137
Time taken = 3.834922 seconds...

Number of elements  42503100
Number of regions  6985
Time taken = 4.785480 seconds...

Number of elements  21841302
Number of regions  7619335
Time taken = 6.806867 seconds...

Number of elements  19723928
Number of regions  10799
Time taken = 2.264413 seconds...

Number of elements  16619539
Number of regions  19293
Time taken = 1.778873 seconds...
现在是他的音乐版本

Number of elements  33718251
Number of regions  135137
Time taken = 0.286336 seconds...

Number of elements  42503100
Number of regions  6985
Time taken = 0.174769 seconds...

Memory error sample 3 (too many regions)

Number of elements  19723928
Number of regions  10799
Time taken = 0.087028 seconds...

Number of elements  16619539
Number of regions  19293
Time taken = 0.084963 seconds...

不管怎么说,我认为这个故事的寓意是numpy非常好

这是一种矢量化方法-

idx = np.concatenate(([0],np.flatnonzero(x[:-1]!=x[1:])+1,[x.size]))
out = zip(x[idx[:-1]],np.diff(idx))
样本运行-

In [34]: x
Out[34]: array([1, 1, 1, 2, 2, 3, 0, 0, 0, 0, 0, 1, 2, 3, 1, 1, 0, 0, 0])

In [35]: out
Out[35]: [(1, 3), (2, 2), (3, 1), (0, 5), (1, 1), (2, 1), (3, 1), (1, 2), (0, 3)]

整个阵列上的串联可能非常昂贵。因此,可以建议对组移位索引进行串联的修改版本,如下所示-

idx0 = np.flatnonzero(x[:-1]!=x[1:])
count = np.concatenate(([idx0[0]+1],np.diff(idx0),[x.size-idx0[-1]-1]))
out = zip(x[np.append(0,idx0+1)],count)
out = np.column_stack((x[np.append(0,idx0+1)],count))
或者,在最后一步,如果作为
2D
数组的输出是正确的,我们可以避免
zipping
并使用NumPy的column_堆栈,如下所示-

idx0 = np.flatnonzero(x[:-1]!=x[1:])
count = np.concatenate(([idx0[0]+1],np.diff(idx0),[x.size-idx0[-1]-1]))
out = zip(x[np.append(0,idx0+1)],count)
out = np.column_stack((x[np.append(0,idx0+1)],count))

这里是一个Numpyhonic-pythonic方法:

In [192]: [(i[0], len(i)) for i in np.split(x, np.where(np.diff(x) != 0)[0]+1)]
Out[192]: [(1, 3), (2, 2), (3, 1), (0, 5), (1, 1), (2, 1), (3, 1), (1, 2), (0, 3)]
下面是一种基于生成器的方法,使用
itertools.groupby()

或:


您可能需要的只是
np.diff
,而且更容易阅读。创建一个掩码

x    = np.array([1,1,1,2,2,3,0,0,0,0,0,1,2,3,1,1,0,0,0])
mask = np.where( np.diff(x) != 0)[0]
mask = np.hstack((-1, mask, len(x)-1 ))

zip( x[mask[1:]], np.diff(mask) )

这应该是最容易理解的,并且是完全矢量化的(不确定
zip
)…

首先,您在第1行缺少了一个结束括号。您能告诉我们您在建议的解决方案中可能获得的加速效果(如果有)吗?当然,我会将您的与我的和其他解决方案进行比较。很好!感谢您的编辑!