Python 3.x numpy数组中具有少数唯一值的多个子字符串搜索_Python 3.x_Performance_Numpy_Substring

Python 3.x numpy数组中具有少数唯一值的多个子字符串搜索

python-3.x performance numpy

Python 3.x numpy数组中具有少数唯一值的多个子字符串搜索,python-3.x,performance,numpy,substring,Python 3.x,Performance,Numpy,Substring,受问题启发：假设我有一个多个1Dnumpy数组xs的列表，我想知道有多少作为另一个更大的1Dnumpy数组y的“子字符串”出现我们可以假设数组包含整数，a是b的子字符串，如果a==b[p:q]对于某些整数p和q 我建议的解决方案使用Python的字节对象的in操作符，但我认为如果xs有许多元素，那么它的效率会很低： import numpy as np N = 10_000 # number of arrays to search M = 3 # "alphabet"

受问题启发：假设我有一个多个1D

numpy

数组

xs

的列表，我想知道有多少作为另一个更大的1D

numpy

数组

的“子字符串”出现

我们可以假设数组包含整数，

是

的子字符串，如果

a==b[p:q]

对于某些整数

和

我建议的解决方案使用Python的

字节

对象的

in

操作符，但我认为如果

xs

有许多元素，那么它的效率会很低：

import numpy as np

N = 10_000    # number of arrays to search
M = 3         # "alphabet" size 
K = 1_000_000 # size of the target array

xs = [np.random.randint(0, M, size=7) for _ in range(N)]
y = np.random.randint(0, M, size=K)

y_bytes = y.tobytes()
%time num_matches = sum(1 for x in xs if x.tobytes() in y_bytes)
# CPU times: user 1.03 s, sys: 17 µs, total: 1.03 s
# Wall time: 1.03 s

如果

很大（任何

xs

的

中可能存在的值的数量很大），那么我认为几乎无法加快速度。然而，对于小型

来说，我认为使用trie或类似的工具可能会有所帮助。在Python中是否有一种有效的方法来实现这一点，可能使用

numpy

numba

？

对于较小的

，我们可以根据其中的整数组合为每个

xs

分配一个唯一的标签。同样，我们可以利用卷积和缩放数组，从而将

xs

中的每一个减少到一个标量。最后，我们使用匹配方法来检测，因此，计数

唯一的问题是从数组列表转换为数组。因此，如果在创建列表之前对其本身进行了优化，使其具有一个数组，那么它将在最终性能数字上有很大帮助

实现看起来像这样-

x = np.asarray(xs) # convert to array, if not already done

s = M**np.arange(x.shape[1])
yr = np.convolve(y,s[::-1])
xr = x.dot(s)

# Final step : Match and get count
N = np.maximum(xr.max(),yr.max())+1 # or use s[-1]*M if M is small enough
l = np.zeros(N, dtype=bool)
l[yr] = True
count = l[xr].sum()

from numba import njit

@njit
def numba1(y, conv_out, M, L, N):
    A = M**L
    for i in range(1,N):
        conv_out[i] = conv_out[i-1]*M + y[i+L-1] - y[i-1]*A
    return conv_out

def numba_convolve(y, M, L):
        N = len(y)-L+1
        conv_out = np.empty(N, dtype=int)
        conv_out[0] = y[:L].dot(M**np.arange(L-1,-1,-1))
        return numba1(y, conv_out, M, L, N)

def intersection_count(xs, y):
    x = np.asarray(xs) # convert to array, if not already done

    L = x.shape[1]
    s = M**np.arange(L-1,-1,-1)
    xr = x.dot(s)

    yr_numba = numba_convolve(y, M=M, L=L)

    # Final step : Match and get count
    N = s[0]*M
    l = np.zeros(N, dtype=bool)
    l[yr_numba] = True
    count = l[xr].sum()
    return count

执行
最后一步的备选方案

备选方案#1：

备选方案2：

备选方案3：

对于较大的

number，我们可以使用

空

数组-

l = np.empty(N, dtype=bool)
l[xr] = False
l[yr] = True
count = l[xr].sum()

进一步挖掘（利用

卷积上的numba
）

对主要建议解决方案的分析表明，1D
卷积部分是耗时的部分。更进一步，我们看到1D
卷积码有一个特定的内核，它本质上是几何的。这可以在每次迭代重新使用边界元素后在O（n）
中实现。请注意，与前面提出的内核相比，这基本上是一个反向内核。所以，把所有这些变化放在一起，我们最终会得到这样的结果-
x = np.asarray(xs) # convert to array, if not already done

s = M**np.arange(x.shape[1])
yr = np.convolve(y,s[::-1])
xr = x.dot(s)

# Final step : Match and get count
N = np.maximum(xr.max(),yr.max())+1 # or use s[-1]*M if M is small enough
l = np.zeros(N, dtype=bool)
l[yr] = True
count = l[xr].sum()

from numba import njit

@njit
def numba1(y, conv_out, M, L, N):
    A = M**L
    for i in range(1,N):
        conv_out[i] = conv_out[i-1]*M + y[i+L-1] - y[i-1]*A
    return conv_out

def numba_convolve(y, M, L):
        N = len(y)-L+1
        conv_out = np.empty(N, dtype=int)
        conv_out[0] = y[:L].dot(M**np.arange(L-1,-1,-1))
        return numba1(y, conv_out, M, L, N)

def intersection_count(xs, y):
    x = np.asarray(xs) # convert to array, if not already done

    L = x.shape[1]
    s = M**np.arange(L-1,-1,-1)
    xr = x.dot(s)

    yr_numba = numba_convolve(y, M=M, L=L)

    # Final step : Match and get count
    N = s[0]*M
    l = np.zeros(N, dtype=bool)
    l[yr_numba] = True
    count = l[xr].sum()
    return count

标杆管理
我们将重新使用问题中的设置
In [42]: %%timeit
    ...: y_bytes = y.tobytes()
    ...: p = sum(1 for x in xs if x.tobytes() in y_bytes)
927 ms ± 3.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [43]: %timeit intersection_count(xs, y)
7.55 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

如前所述，到阵列的转换可能是瓶颈。那么，让我们也来计时这一部分-
In [44]: %timeit np.asarray(xs)
3.41 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

因此，数组转换部分大约占总运行时间的45%，这是一个重要部分。因此，在这一点上，建议使用二维阵列而不是一维阵列列表变得至关重要。好处是阵列数据为我们提供了矢量化功能，从而提高了总体性能。为了强调2D阵列的可用性，以下是带和不带阵列的加速效果-
In [45]: 927/7.55
Out[45]: 122.78145695364239

In [46]: 927/(7.55-3.41)
Out[46]: 223.91304347826087

xs
中的所有数组是否都具有相同的长度？请参见此处：@Divakar-sure，假设xs
中的所有数组具有相同的长度就可以了。