在numpy和python中快速删除重复项_Python_Arrays_Performance_Optimization_Numpy

在numpy和python中快速删除重复项

python arrays performance optimization numpy

在numpy和python中快速删除重复项,python,arrays,performance,optimization,numpy,Python,Arrays,Performance,Optimization,Numpy,在numpy中有没有快速获得独特元素的方法？我有类似的代码（最后一行）这只是一个例子，在我的情况下，indices1，indices2，…，indices4包含不同的索引集并具有不同的大小。最后一行执行了很多次，我注意到它实际上是我的代码中的瓶颈（{numpy.core.multiarray.arange}是进动的）。此外，排序并不重要，索引数组中的元素属于int32类型。我正在考虑使用哈希表，将元素值作为键，并尝试： seq = itertools.chain(tab[indices1].f

在numpy中有没有快速获得独特元素的方法？我有类似的代码（最后一行）

这只是一个例子，在我的情况下，

indices1，indices2，…，indices4

包含不同的索引集并具有不同的大小。最后一行执行了很多次，我注意到它实际上是我的代码中的瓶颈（

{numpy.core.multiarray.arange}

是进动的）。此外，排序并不重要，索引数组中的元素属于

int32

类型。我正在考虑使用哈希表，将元素值作为键，并尝试：

seq = itertools.chain(tab[indices1].flatten(), tab[indices2].flatten(), tab[indices3].flatten(), tab[indices4].flatten())
myset = {}
map(myset.__setitem__, seq, [])
result = numpy.array(myset.keys())

但情况更糟

有没有办法加快速度？我猜性能损失来自复制数组的“花式索引”，但我只需要读取结果元素（我不修改任何内容）

[以下内容实际上部分不正确（请参见PS）：]

以下获取所有子阵列中唯一元素的方法非常快速：

seq = itertools.chain(tab[indices1].flat, tab[indices2].flat, tab[indices3].flat, tab[indices4].flat)
result = set(seq)

请注意，使用

flat

（返回迭代器）代替

flant（）

（返回完整数组），并且可以直接调用

set（）

（而不是像第二种方法那样使用

map（）

和字典）

以下是计时结果（在IPython外壳中获得）：

因此，在本例中，set/flat方法的速度提高了40倍

PS：设置（seq）的时间实际上并不具有代表性。事实上，计时的第一个循环清空

seq

迭代器，随后的

set（）

求值返回一个空集！正确的正时测试如下所示

>>> %timeit set(itertools.chain(tab[indices1].flat, tab[indices2].flat, tab[indices3].flat, tab[indices4].flat))
100 loops, best of 3: 9.12 ms per loop

这表明set/flat方法实际上并不快

PPS：这里是对mtrw建议的（不成功的）探索；事先找到唯一的索引可能是个好主意，但我找不到比上述方法更快的实现方法：

>>> %timeit set(indices1).union(indices2).union(indices3).union(indices4)
100 loops, best of 3: 11.9 ms per loop
>>> %timeit set(itertools.chain(indices1.flat, indices2.flat, indices3.flat, indices4.flat))
100 loops, best of 3: 10.8 ms per loop

因此，找到所有不同索引的集合本身相当缓慢

PPP：

numpy.unique（）

实际上比

set（）

快2-3倍。这是在Bago的答案中获得加速的关键（

unique（concatenate（…））

）。原因可能是让NumPy自己处理它的数组通常比用NumPy数组连接纯Python（

set

）要快

结论：因此，这个答案只记录了不应该完全遵循的失败尝试，以及关于使用迭代器计时代码的可能有用的注释…

很抱歉，我不完全理解您的问题，但我会尽我所能提供帮助

Fist{numpy.core.multiarray.arange}是numpy.arange而不是花式索引，不幸的是，花式索引并没有作为单独的行项目显示在探查器中。如果你在循环中调用np.arange，你应该看看是否可以把它移到外面

In [27]: prun tab[tab]
     2 function calls in 1.551 CPU seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    1.551    1.551    1.551    1.551 <string>:1(<module>)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler'    objects}

In [28]: prun numpy.arange(10000000)
     3 function calls in 0.051 CPU seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.047    0.047    0.047    0.047 {numpy.core.multiarray.arange}
    1    0.003    0.003    0.051    0.051 <string>:1(<module>)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

（另外，np.concatenate给出了一个（4*n，）数组，np.array给出了一个（4，n）数组，其中n是索引[1-4]的长度。后者仅在索引1-4都是相同长度时才起作用。）

最后，如果您能做到以下几点，您还可以节省更多的时间：

indices = np.unique(np.concatenate((indices1, indices2, indices3, indices4)))
result = tab[indices]

按此顺序执行会更快，因为您减少了需要在tab中查找的索引数量，但只有当您知道tab的元素是唯一的时，它才会起作用（否则，即使索引是唯一的，结果也可能重复）

希望这有助于

将其转换为一个集合，然后再转换回一个numpy数组的速度有多快？我已经检查了这个方法，它实际上大约有20%的工作效率。你能不能找到唯一的

索引

，然后用它们查找

选项卡

？@mtrw：这听起来是个好主意，但是我找不到比答案中的第一个方法更快的实现。

concatenate的+1。为了使此方法在任意输入数组tab
的一般情况下工作，我建议只需执行result=np.unique（tab[np.unique（np.concatenate（（指示符1，…））]）。这大约是原始问题中方法的两倍。@EOL，是的，如果索引有很多重复，这是一个选项，在这种情况下，复制重复的开销可能是值得的。
In [27]: prun tab[tab]
     2 function calls in 1.551 CPU seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    1.551    1.551    1.551    1.551 <string>:1(<module>)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler'    objects}

In [28]: prun numpy.arange(10000000)
     3 function calls in 0.051 CPU seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.047    0.047    0.047    0.047 {numpy.core.multiarray.arange}
    1    0.003    0.003    0.051    0.051 <string>:1(<module>)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

In [47]: timeit numpy.array([tab[indices1], tab[indices2], tab[indices3], tab[indices4]])
100 loops, best of 3: 5.11 ms per loop

In [48]: timeit numpy.concatenate([tab[indices1], tab[indices2], tab[indices3],     tab[indices4]])
1000 loops, best of 3: 544 us per loop

indices = np.unique(np.concatenate((indices1, indices2, indices3, indices4)))
result = tab[indices]