Arrays 矢量化比较_Arrays_Performance_Numpy_Vectorization

Arrays 矢量化比较

arrays performance numpy

Arrays 矢量化比较,arrays,performance,numpy,vectorization,Arrays,Performance,Numpy,Vectorization,我写了一个函数，它从非均匀分布中提取元素，并返回输入数组元素的索引，就像它们从均匀分布中提取一样。以下是代码和示例： import numpy as np def uniform_choice(x, n): unique, counts = np.unique(x, return_counts=True) element_freq = np.zeros(x.shape) for i in range(len(unique)): element_freq[n

我写了一个函数，它从非均匀分布中提取元素，并返回输入数组元素的索引，就像它们从均匀分布中提取一样。以下是代码和示例：

import numpy as np

def uniform_choice(x, n):
    unique, counts = np.unique(x, return_counts=True)
    element_freq = np.zeros(x.shape)
    for i in range(len(unique)):
       element_freq[np.where(x == unique[i])[0]] = counts[i]
    p = 1/element_freq/(1/element_freq).sum()
    return np.random.choice(x, n, False, p)

x = np.random.choice(["a", "b", "c", "d"], 100000, p=(0.1, 0.2, 0.3, 0.4))
#so this gives an non-uniform distribution of elements "a", "b", "c", "d"
np.unique(x, return_counts=True)

#returns
(array(['a', 'b', 'c', 'd'], dtype='<U1'), 
array([10082, 19888, 30231, 39799]))

是否可以避免函数中的for循环。我需要在非常大的阵列上进行多次采样，因此速度会变慢。我相信矢量化版本的比较会给我更快的结果。

可能类似这样（未经测试）：

这将从唯一集生成

值，然后使用

searchsorted

在原始数组中查找这些值，并返回它们的索引

我认为这种方法的一个不同之处是，您只能在

中获得每个值出现的第一个索引。也就是说，多次出现在

中的值将始终由其一次出现的索引表示，而在您的原始代码中，它可能是多个的。

您可以通过扩展使用来包含

return\u inverse=True

来消除循环部分，我认为这是最耗时的部分，这将为

中的每个唯一字符串提供唯一的数字标签。然后，这些数字标签可以用作索引，引导我们对

元素\u频率

进行矢量化计算。因此，循环部分-

unique, counts = np.unique(x, return_counts=True)
element_freq = np.zeros(x.shape)
for i in range(len(unique)):
   element_freq[np.where(x == unique[i])[0]] = counts[i]

将由-

unique, idx, counts = np.unique(x, return_inverse=True, return_counts=True)
element_freq = counts[idx]

运行时测试-

In [18]: x = np.random.choice(["a", "b", "c", "d"], 100000, p=(0.1, 0.2, 0.3, 0.4))

In [19]: %%timeit 
    ...: unique, counts = np.unique(x, return_counts=True)
    ...: element_freq = np.zeros(x.shape)
    ...: for i in range(len(unique)):
    ...:    element_freq[np.where(x == unique[i])[0]] = counts[i]
    ...: 
100 loops, best of 3: 18.9 ms per loop

In [20]: %%timeit 
    ...: unique, idx, counts =np.unique(x,return_inverse=True, return_counts=True)
    ...: element_freq = counts[idx]
    ...: 
100 loops, best of 3: 12.9 ms per loop

这种区别对我很重要。这是一个更大问题的一部分，关键是选择指数，就像元素均匀分布一样。例如，假设你有很多文章，其中50%来自体育，25%来自政治，25%来自娱乐。所有的文章都是独一无二的，我想抽取10000篇独一无二的文章，这样每个主题都有33.33%的政治、33.33%的体育和33.33%的娱乐。@enedene:如果你概述了我的情况，我会做一些简单的事情：首先随机选择主题，然后随机选择一篇关于该主题的文章。我从来没有看过return_，谢谢你指出这一点。这要快得多。谢谢。@enedene很高兴得到性能改进的确认！

unique, idx, counts = np.unique(x, return_inverse=True, return_counts=True)
element_freq = counts[idx]

In [18]: x = np.random.choice(["a", "b", "c", "d"], 100000, p=(0.1, 0.2, 0.3, 0.4))

In [19]: %%timeit 
    ...: unique, counts = np.unique(x, return_counts=True)
    ...: element_freq = np.zeros(x.shape)
    ...: for i in range(len(unique)):
    ...:    element_freq[np.where(x == unique[i])[0]] = counts[i]
    ...: 
100 loops, best of 3: 18.9 ms per loop

In [20]: %%timeit 
    ...: unique, idx, counts =np.unique(x,return_inverse=True, return_counts=True)
    ...: element_freq = counts[idx]
    ...: 
100 loops, best of 3: 12.9 ms per loop