Python 计算numpy数组中的唯一项：为什么scipy.stats.itemfreq如此缓慢？_Python_Python 2.7_Numpy_Scipy

Python 计算numpy数组中的唯一项：为什么scipy.stats.itemfreq如此缓慢？

python python-2.7 numpy

Python 计算numpy数组中的唯一项：为什么scipy.stats.itemfreq如此缓慢？,python,python-2.7,numpy,scipy,Python,Python 2.7,Numpy,Scipy,我试图计算numpy数组中的唯一值 import numpy as np from collections import defaultdict import scipy.stats import time x = np.tile([1,2,3,4,5,6,7,8,9,10],20000) for i in [44,22,300,403,777,1009,800]: x[i] = 11 def getCounts(x): counts = defaultdict(int)

我试图计算numpy数组中的唯一值

import numpy as np
from collections import defaultdict
import scipy.stats
import time

x = np.tile([1,2,3,4,5,6,7,8,9,10],20000)
for i in [44,22,300,403,777,1009,800]:
    x[i] = 11

def getCounts(x):
    counts = defaultdict(int)
    for item in x:
        counts[item] += 1
    return counts

flist = [getCounts, scipy.stats.itemfreq]

for f in flist:
    print f
    t1 = time.time()
    y = f(x)
    t2 = time.time()
    print y
    print '%.5f sec' % (t2-t1)

起初，我找不到一个内置函数来执行此操作，所以我编写了

getCounts（）

；然后我发现我会用它来代替。但是它很慢！这是我在电脑上看到的。为什么它比这么简单的手写功能慢

<function getCounts at 0x0000000013C78438>
defaultdict(<type 'int'>, {1: 19998, 2: 20000, 3: 19999, 4: 19999, 5: 19999, 6: 20000, 7: 20000, 8: 19999, 9: 20000, 10: 19999, 11: 7})
0.04700 sec
<function itemfreq at 0x0000000013C5D208>
[[  1.00000000e+00   1.99980000e+04]
 [  2.00000000e+00   2.00000000e+04]
 [  3.00000000e+00   1.99990000e+04]
 [  4.00000000e+00   1.99990000e+04]
 [  5.00000000e+00   1.99990000e+04]
 [  6.00000000e+00   2.00000000e+04]
 [  7.00000000e+00   2.00000000e+04]
 [  8.00000000e+00   1.99990000e+04]
 [  9.00000000e+00   2.00000000e+04]
 [  1.00000000e+01   1.99990000e+04]
 [  1.10000000e+01   7.00000000e+00]]
2.04100 sec


defaultdict（，{1:19998，2:20000，3:19999，4:19999，5:19999，6:20000，7:20000，8:19999，9:20000，10:19999，11:7}）
0.04700秒
[[1.00000000e+0019998000E+04]
[2.00000000e+00200000000E+04]
[3.00000000e+00199900000E+04]
[4.00000000e+001.9999000E+04]
[5.00000000e+001.9999000E+04]
[6.00000000e+00200000000E+04]
[7.00000000e+00200000000E+04]
[8.00000000e+00199900000E+04]
[9.00000000e+00200000000E+04]
[1.00000000e+011.9999000E+04]
[1.10000000e+017.00000000e+00]]
2.04100秒

首先，

时间。计时时使用时间是错误的函数，因为它测量的是挂钟时间，而不是cpu时间（请参阅）。理想情况下，您可以使用timeit
模块，但是time.clock
也更好
此外，您可能正在使用过时的scipy版本。我正在使用Python 3.4和scipy 0.14.0，以下是我的时间安排：
x = np.tile([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 20000)
for i in [44, 22, 300, 403, 777, 1009, 800]:
    x[i] = 11

%timeit getCounts(x)
# 10 loops, best of 3: 55.6 ms per loop

%timeit scipy.stats.itemfreq(x)
# 10 loops, best of 3: 20.8 ms per loop

%timeit collections.Counter(x)
# 10 loops, best of 3: 39.9 ms per loop

%timeit np.unique(x, return_counts=True)
# 100 loops, best of 3: 4.13 ms per loop

如果可以使用numpy 1.9，则可以使用numpy.unique
参数return\u counts=True
。即
unique_items, counts = np.unique(x, return_counts=True)

事实上，itemfreq
已更新为使用np.unique
，但scipy目前支持numpy版本回到1.5，因此它不使用return\u counts
参数
下面是scipy 0.14中itemfreq
的完整实现：
def itemfreq(a):
    items, inv = np.unique(a, return_inverse=True)
    freq = np.bincount(inv)
    return np.array([items, freq]).T

谢谢你的回复。我还不能使用numpy 1.9或scipy 0.14，因为我的应用程序中存在一些模块冲突，但新的scipy.stats.itemfreq看起来要快得多：
import numpy as np
from collections import defaultdict, Counter
import scipy.stats
import time
import timeit

x = np.tile([1,2,3,4,5,6,7,8,9,10],20000)
for i in [44,22,300,403,777,1009,800]:
    x[i] = 11

def getCounts(x):
    counts = defaultdict(int)
    for item in x:
        counts[item] += 1
    return counts

def itemfreq_scipy14(x):
    '''this is how itemfreq works in 0.14:
    https://github.com/scipy/scipy/commit/7e04d6630f229693cca3522b62aa16226f174053
    '''
    items, inv = np.unique(x, return_inverse=True)
    freq = np.bincount(inv)
    return np.array([items, freq]).T

flist = [getCounts, scipy.stats.itemfreq, np.bincount, itemfreq_scipy14, Counter]


for f in flist:
    print f
    print timeit.timeit(lambda: f(x),number=3)

在我的电脑上产生：
<function getCounts at 0x0000000013F8EB38>
0.148138969181
<function itemfreq at 0x0000000013C5D208>
6.15385023664
<built-in function bincount>
0.00313706656675
<function itemfreq_scipy14 at 0x0000000013F8EDD8>
0.0757223407165
<class 'collections.Counter'>
0.255281199559


0.148138969181
6.15385023664
0.00313706656675
0.0757223407165
0.255281199559
对于懒惰的人：
import pandas as pd
pd.Series( my_list_or_array ).nunique()

此功能上有一个发行票据。看起来维护人员对它的编写方式也有类似的担忧。如果你想知道为什么这么慢，也许可以看看。取决于您在提交之前还是之后使用的是scipy版本，您应该能够看到它是如何实现的。我从未意识到numpy函数的加速效果。让你想知道为什么它没有在scipy函数中使用。实际上，我可以在我的应用程序中使用np.bincount
，因为我有一个大的小整数数组。