Python 在NumPy数组中搜索序列_Python_Numpy_Search

Python 在NumPy数组中搜索序列

python numpy search

Python 在NumPy数组中搜索序列,python,numpy,search,Python,Numpy,Search,假设我有以下数组： array([2, 0, 0, 1, 0, 1, 0, 0]) 如何获取出现值序列的索引：[0,0]？因此，这种情况下的预期输出是：[1,2,6,7] 编辑： 1）请注意，[0,0]只是一个序列。它可以是[0,0,0]或[4,6,8,9]或[5,2,0]，任何东西都可以 2）如果将我的数组修改为：数组（[2,0,0,0,0,0,1,0,0]），则与[0,0]相同序列的预期结果将是[1,2,3,4,8,9] 我正在寻找一些NumPy快捷方式。好吧，这基本上是图像处理中经

假设我有以下数组：

 array([2, 0, 0, 1, 0, 1, 0, 0])

如何获取出现值序列的索引：

[0,0]

？因此，这种情况下的预期输出是：

[1,2,6,7]

编辑：

1）请注意，

[0,0]

只是一个序列。它可以是

[0,0,0]

或

[4,6,8,9]

或

[5,2,0]

，任何东西都可以

2）如果将我的数组修改为：

数组（[2,0,0,0,0,0,1,0,0]）

，则与

[0,0]

相同序列的预期结果将是

[1,2,3,4,8,9]

我正在寻找一些NumPy快捷方式。

好吧，这基本上是图像处理中经常出现的一种方法。本文列出了两种方法：基于纯NumPy和基于OpenCV（cv2）的方法

方法#1:使用NumPy，可以在整个输入数组长度上创建滑动索引的

2D

数组。因此，每一行都是元素的滑动窗口。接下来，将每一行与输入序列匹配，这将引入矢量化解决方案。我们查找所有

True

行，这些行指示那些是完美匹配的，因此是匹配的起始索引。最后，使用这些索引，创建一系列扩展到序列长度的索引，以提供所需的输出。执行工作将是：-

def search_sequence_numpy(arr,seq):
    """ Find sequence in an array using NumPy only.

    Parameters
    ----------    
    arr    : input 1D array
    seq    : input 1D array

    Output
    ------    
    Output : 1D Array of indices in the input array that satisfy the 
    matching of input sequence in the input array.
    In case of no match, an empty list is returned.
    """

    # Store sizes of input array and sequence
    Na, Nseq = arr.size, seq.size

    # Range of sequence
    r_seq = np.arange(Nseq)

    # Create a 2D array of sliding indices across the entire length of input array.
    # Match up with the input sequence & get the matching starting indices.
    M = (arr[np.arange(Na-Nseq+1)[:,None] + r_seq] == seq).all(1)

    # Get the range of those indices as final output
    if M.any() >0:
        return np.where(np.convolve(M,np.ones((Nseq),dtype=int))>0)[0]
    else:
        return []         # No match found

方法#2:使用OpenCV（cv2），我们有一个内置函数，用于

模板匹配。使用这个，我们将有开始匹配索引。其余步骤与前一种方法相同。以下是cv2的实现：
from cv2 import matchTemplate as cv2m

def search_sequence_cv2(arr,seq):
    """ Find sequence in an array using cv2.
    """

    # Run a template match with input sequence as the template across
    # the entire length of the input array and get scores.
    S = cv2m(arr.astype('uint8'),seq.astype('uint8'),cv2.TM_SQDIFF)

    # Now, with floating point array cases, the matching scores might not be 
    # exactly zeros, but would be very small numbers as compared to others.
    # So, for that use a very small to be used to threshold the scorees 
    # against and decide for matches.
    thresh = 1e-5 # Would depend on elements in seq. So, be careful setting this.

    # Find the matching indices
    idx = np.where(S.ravel() < thresh)[0]

    # Get the range of those indices as final output
    if len(idx)>0:
        return np.unique((idx[:,None] + np.arange(seq.size)).ravel())
    else:
        return []         # No match found

运行时测试
In [477]: arr = np.random.randint(0,9,(100000))
     ...: seq = np.array([3,6,8,4])
     ...: 

In [478]: np.allclose(search_sequence_numpy(arr,seq),search_sequence_cv2(arr,seq))
Out[478]: True

In [479]: %timeit search_sequence_numpy(arr,seq)
100 loops, best of 3: 11.8 ms per loop

In [480]: %timeit search_sequence_cv2(arr,seq)
10 loops, best of 3: 20.6 ms per loop

看起来纯NumPy是最安全最快的
 我发现最简洁、直观和通用的方法是使用正则表达式
import re
import numpy as np

# Set the threshold for string printing to infinite
np.set_printoptions(threshold=np.inf)

# Remove spaces and linebreaks that would come through when printing your vector
yourarray_string = re.sub('\n|\s','',np.array_str( yourarray ))[1:-1]

# The next line is the most important, set the arguments in the braces
# such that the first argument is the shortest sequence you want
# and the second argument is the longest (using empty as infinite length)

r = re.compile(r"[0]{1,}") 
zero_starts = [m.start() for m in r.finditer( yourarray_string )]
zero_ends = [m.end() for m in r.finditer( yourarray_string )]

关于数组（[2,0,0,1,0,1,0,1,0]）？如果我正确理解了你的问题，你想要一个能适应任何序列的通用方法，[0,0]只是一个例子？
import re
import numpy as np

# Set the threshold for string printing to infinite
np.set_printoptions(threshold=np.inf)

# Remove spaces and linebreaks that would come through when printing your vector
yourarray_string = re.sub('\n|\s','',np.array_str( yourarray ))[1:-1]

# The next line is the most important, set the arguments in the braces
# such that the first argument is the shortest sequence you want
# and the second argument is the longest (using empty as infinite length)

r = re.compile(r"[0]{1,}") 
zero_starts = [m.start() for m in r.finditer( yourarray_string )]
zero_ends = [m.end() for m in r.finditer( yourarray_string )]