用Tensorflow或numpy计算矢量化运行箱指数_Numpy_Tensorflow

用Tensorflow或numpy计算矢量化运行箱指数

numpy tensorflow

用Tensorflow或numpy计算矢量化运行箱指数,numpy,tensorflow,Numpy,Tensorflow,我有一个整数数组，如下所示： in=[1,2,6,1,3,2,1] 我想为数组中的相等值计算一个运行索引。对于上述矩阵，输出为： out=[0,0,0,1,0,1,2] 因此，最简单的实现是为所有值设置一个计数器。我希望有一个矢量化的解决方案，用tensorflow运行它，也许用numpy 我已经考虑过创建一个shape=（in.shape[0]，tf.max（in），的二维张量，并将1写入tensor[I，in[I]]单元格，然后按列调用cumsum然后按行回写。但我的输入数组相当大（有几个1

我有一个整数数组，如下所示：

in=[1,2,6,1,3,2,1]

我想为数组中的相等值计算一个运行索引。对于上述矩阵，输出为：

out=[0,0,0,1,0,1,2]

因此，最简单的实现是为所有值设置一个计数器。我希望有一个矢量化的解决方案，用tensorflow运行它，也许用numpy

我已经考虑过创建一个

shape=（in.shape[0]，tf.max（in），

的二维张量，并将

写入

tensor[I，in[I]]

单元格，然后按列调用

cumsum

然后按行回写。但我的输入数组相当大（有几个100k条目），最大值约为500k，因此这个稀疏矩阵甚至无法放入内存

你有更好的建议吗？谢谢大家!

这里有一个解决方案：

s = pd.Series([1, 2, 6, 1, 3, 2, 1])
s.groupby(s).cumcount().values

输出：

array([0, 0, 0, 1, 0, 1, 2], dtype=int64)

相似大小数据的测试：

s = pd.Series(np.random.randint(0,500000, 100000))
%timeit -n 100 s.groupby(s).cumcount().values
# 23.9 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

您可以使用实际的稀疏矩阵，即使用稀疏存储。有了这样的输入，像

a=np.random.randint（0,5*10**5,10**6）

就没有问题了：

import numpy as np
from scipy import sparse

def running(a):
    n,m = a.size,a.max()+1
    aux = sparse.csr_matrix((np.ones_like(a),a,np.arange(n+1)),(n,m)).tocsc()
    msk = aux.indptr[1:] != aux.indptr[:-1]
    indptr = aux.indptr[:-1][msk]
    aux.data[0] = 0
    aux.data[indptr[1:]] -= np.diff(indptr)
    out = np.empty_like(a)
    out[aux.indices] = aux.data.cumsum()
    return out

# alternative method for validation
def use_argsort(a):
    indices = a.argsort(kind="stable")
    ao = a[indices]
    indptr = np.concatenate([[0],(ao[1:] != ao[:-1]).nonzero()[0]+1])
    data = np.ones_like(a)
    data[0] = 0
    data[indptr[1:]] -= np.diff(indptr)
    out = np.empty_like(a)
    out[indices] = data.cumsum()
    return out

in_ = np.array([1, 2, 6, 1, 3, 2, 1])
print("OP example",in_,"->",running(in_))
print("second opinion","->",use_argsort(in_))

from timeit import timeit
A = np.random.randint(0,500_000,1_000_000)
print("large example (500k labels, 1M entries) takes",
      timeit(lambda:running(A),number=10)*100,"ms")
print("using other method takes",
      timeit(lambda:use_argsort(A),number=10)*100,"ms")
print("same result:",(use_argsort(A) == running(A)).all())

样本运行：

OP example [1 2 6 1 3 2 1] -> [0 0 0 1 0 1 2]
second opinion -> [0 0 0 1 0 1 2]
large example (500k labels, 1M entries) takes 84.1427305014804 ms
using other method takes 262.38483290653676 ms
same result: True

我试着用TensorFlow来做这件事，结果很快就变得很难看。因此，如果只有10万美元，您最好使用此解决方案。