Python 查找大于N的NumPy数组中的值的开始/停止索引范围_Python_Numpy

Python 查找大于N的NumPy数组中的值的开始/停止索引范围

python numpy

Python 查找大于N的NumPy数组中的值的开始/停止索引范围,python,numpy,Python,Numpy,假设我有一个NumPy数组： x = np.array([2, 3, 4, 0, 0, 1, 1, 4, 6, 5, 8, 9, 9, 4, 2, 0, 3]) 对于x>=2中的所有值，我需要找到开始/停止索引，其中x>=2的连续值（即，不计算一个大于或等于2的单个值的运行）。然后，我对x>=3，x>=4，…，x>=x.max（）输出应为NumPy数组三列（第一列为最小值，第二列为包含性开始索引，第三列为停止索引），如下所示： [[2, 0, 2], [2, 7, 14], [3,

假设我有一个NumPy数组：

x = np.array([2, 3, 4, 0, 0, 1, 1, 4, 6, 5, 8, 9, 9, 4, 2, 0, 3])

对于

x>=2

中的所有值，我需要找到开始/停止索引，其中

x>=2

的连续值（即，不计算一个大于或等于2的单个值的运行）。然后，我对

x>=3

，

x>=4

，…，

x>=x.max（）输出应为NumPy数组三列（第一列为最小值，第二列为包含性开始索引，第三列为停止索引），如下所示：
[[2,  0,  2],
 [2,  7, 14],
 [3,  1,  2],
 [3,  7, 13],
 [4,  7, 13],
 [5,  8, 12],
 [6, 10, 12],
 [8, 10, 12],
 [9, 11, 12]
]

天真地，我可以查看每个唯一的值，然后搜索开始/停止索引。但是，这需要在x
上进行多次传递。完成此任务的最佳NumPy矢量化方式是什么？是否有一种解决方案不需要对数据进行多次传递
更新
我意识到我还需要计算单个实例。因此，我的输出应该是：
[[2,  0,  2],
 [2,  7, 14],
 [2, 16, 16],  # New line needed
 [3,  1,  2],
 [3,  7, 13],
 [3, 16, 16],  # New line needed
 [4,  2,  2],  # New line needed
 [4,  7, 13],
 [5,  8, 12],
 [6,  8,  8],  # New line needed
 [6, 10, 12],
 [8, 10, 12],
 [9, 11, 12]
]

这确实是一个很有趣的问题。我试图把它分成三部分来解决
分组：
import numpy as np
import pandas as pd
x = np.array([2, 3, 4, 0, 0, 1, 1, 4, 6, 5, 8, 9, 9, 4, 2, 0, 3])
groups = pd.DataFrame(x).groupby([0]).indices

因此，组是字典{0:[3,4,15]，1:[5,6]，2:[0,14]，3:[1,16]，4:[2,7,13]，5:[9]，6:[8]，8:[10]，9:[11,12]}
，其值是dtype=int64
的数组
屏蔽：
import numpy as np
import pandas as pd
x = np.array([2, 3, 4, 0, 0, 1, 1, 4, 6, 5, 8, 9, 9, 4, 2, 0, 3])
groups = pd.DataFrame(x).groupby([0]).indices

在本部分中，我按降序遍历了每个唯一值I
的几个掩码数组x>=I
：
mask_array = np.zeros(x.size).astype(int)
for group in list(groups)[::-1]:
    mask = mask_array[groups[group]] = 1
    # print(group, ':', mask_array)
    # output = find_slices(mask)

这些面具看起来像这样：
9 : [0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0]
8 : [0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0]
6 : [0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0]
5 : [0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0]
4 : [0 0 1 0 0 0 0 1 1 1 1 1 1 1 0 0 0]
3 : [0 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 1]
2 : [1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1]
1 : [1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1]
0 : [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

从遮罩中提取切片：
我希望构造一个名为find_slices
的函数，从掩码数组中提取切片位置（如果取消注释）。这就是我所做的：
def find_slices(m):
    m1 = np.r_[0, m]
    m2 = np.r_[m, 0]
    starts, = np.where(~m1 & m2)
    ends, = np.where(m1 & ~m2)
    return np.c_[starts, ends - 1]

例如，数组的切片位置[01101001]
将是[[1,2]，[7,13]，[16,16]
。请注意，这不是返回切片的标准方式，结束位置通常增加1
最终脚本
毕竟，我们需要一些策略来实现预期的输出，这里看起来就像是在结尾：
import numpy as np
import pandas as pd
x = np.array([2, 3, 4, 0, 0, 1, 1, 4, 6, 5, 8, 9, 9, 4, 2, 0, 3])
groups = pd.DataFrame(x).groupby([0]).indices
mask_array = np.zeros(x.size).astype(bool)

m = []
for group in list(groups)[::-1]:
    mask_array[groups[group]] = True
    s = find_slices(mask_array)
    group_output = np.c_[np.repeat(group, s.shape[0]), s] #insert first column
    m.append(group_output) 
output = np.concatenate(m[::-1])
output = output[output[:,1]!= output[:,2]] #elimate slices with unit length

输出：
下面是另一个解决方案（我相信可以改进）：
很抱歉没有评论每个步骤的作用——如果以后我能找到时间，我会解决它。
np。其中（x>2）
可能是一个开始。。。。和tests=np.arange（2，x.max（）+1）；q=np.更大（x，测试[：，无]）；np.argwhere（q）
不应该[7,10,12]
也在结果数组中，它将来自x>=7？@andreak。我看到您已经为这两种情况提供了解决方案。事实上，我可能两者都需要。非常感谢。另外，在e=as_-stripped（d，shape=（len（d）-1,2），strippes=（8,8））
中，strippes（8,8）的相关性是什么。这取决于什么？实际上，我的输入数组要大得多，并且有更多的值。我猜当前的解决方案可能不是最通用的形式来接受一个不同的数组？您可以用d.strips*2
替换（8，8）
，但是由于d
是np的结果。其中返回数据类型为int64的数组，d.strips
是8（字节）不管怎样。@slaw我已经编辑了我的答案，这样它也可以计算单个实例（只需注释一行）。有没有办法避免/替换b=（x>=a[：，无]）
？对于一个大数组，这个密集矩阵消耗了大量内存。@slaw也许您可以将a
（大于1的x
唯一值数组）拆分成子数组并进行迭代。例如f=[]；对于范围（0，len（a），100）内的i:ai=a[i:i+100]；b=（x>=ai[：，无]f.append（np.hstack（[c[：，0][e[：，0，None]]，c[：，1][e]]）之外，其余代码都是相同的。最后，您可以执行f=np.concatenate（f）。
import numpy as np
from numpy.lib.stride_tricks import as_strided

x = np.array([2, 3, 4, 0, 0, 1, 1, 4, 6, 5, 8, 9, 9, 4, 2, 0, 3])

# array of unique values of x bigger than 1
a = np.unique(x[x>=2])

step = len(a)  # if you encounter memory problems, try a smaller step
result = []
for i in range(0, len(a), step):
    ai = a[i:i + step]
    c = np.argwhere(x >= ai[:, None])
    c[:,0] = ai[c[:,0]]
    c =  np.pad(c, ((1,1), (0,0)), 'symmetric')

    d = np.where(np.diff(c[:,1]) !=1)[0]

    e = as_strided(d, shape=(len(d)-1, 2), strides=d.strides*2).copy()
    # e = e[(np.diff(e, axis=1) > 1).flatten()]
    e[:,0] = e[:,0] + 1 

    result.append(np.hstack([c[:,0][e[:,0, None]], c[:,1][e]]))

result = np.concatenate(result)

# array([[ 2,  0,  2],
#        [ 2,  7, 14],
#        [ 2, 16, 16],
#        [ 3,  1,  2],
#        [ 3,  7, 13],
#        [ 3, 16, 16],
#        [ 4,  2,  2],
#        [ 4,  7, 13],
#        [ 5,  8, 12],
#        [ 6,  8,  8],
#        [ 6, 10, 12],
#        [ 8, 10, 12],
#        [ 9, 11, 12]])