Python numpy和比字符串计数慢_Python_String_Numpy_Count

Python numpy和比字符串计数慢

python string numpy

Python numpy和比字符串计数慢,python,string,numpy,count,Python,String,Numpy,Count,我使用字符的numpy数组和字符串方法count 基因组是一条很长的字符串 g1 = genome g2 = np.array([i for i in genome]) %timeit np.sum(g2=='C')

我使用字符的

numpy数组和字符串方法count


基因组是一条很长的字符串
g1 = genome 
g2 =  np.array([i for i in genome])

%timeit np.sum(g2=='C')                                                                                                                                                                             
4.43 s ± 230 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit g1.count('C')                                                                                                                                                                               
955 ms ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each).   

我原以为numpy数组会计算得更快，但我错了
有人能解释一下count方法是如何工作的吗？它比使用numpy数组快多少
谢谢大家!
 让我们探讨一下这个问题的一些变化。我不会试着做一根像你一样大的绳子
In [393]: astr = 'ABCDEF'*10000                                                      

首先是字符串计数：
In [394]: astr.count('C')                                                            
Out[394]: 10000
In [395]: timeit astr.count('C')                                                     
70.2 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

现在尝试使用该字符串创建一个1元素数组：
In [396]: arr = np.array(astr)                                                       
In [397]: arr.shape                                                                  
Out[397]: ()
In [398]: np.char.count(arr, 'C')                                                    
Out[398]: array(10000)
In [399]: timeit np.char.count(arr, 'C')                                             
200 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [400]: arr.dtype                                                                  
Out[400]: dtype('<U60000')

列表计数必须循环遍历元素，并每次针对C
进行测试。它仍然比sum（列表中i的i='C'）

（和变体）快
现在从该列表中创建一个数组-单字符元素：

In [405]: arr1 = np.array(alist) In [406]: arr1.shape Out[406]: (60000,) In [407]: timeit arr1=='C' 634 µs ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [408]: timeit np.sum(arr1=='C') 740 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

np.sum
相对较快。对“C”的检查花费的时间最多
如果我构造一个相同大小的数字数组，计数时间会快很多。对数字的相等性测试比等效字符串测试快

In [431]: arr2 = np.resize(np.array([1,2,3,4,5,6]),arr1.shape[0]) In [432]: np.sum(arr2==3) Out[432]: 10000 In [433]: timeit np.sum(arr2==3) 155 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

numpy
并不能保证所有Python操作都更快。在大多数情况下，在处理字符串元素时，它严重依赖于Python自己的字符串代码
让我们探讨一下这个问题的一些变化。我不会试着做一根像你一样大的绳子

In [393]: astr = 'ABCDEF'*10000
首先是字符串计数：

In [394]: astr.count('C') Out[394]: 10000 In [395]: timeit astr.count('C') 70.2 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
现在尝试使用该字符串创建一个1元素数组：

In [396]: arr = np.array(astr) In [397]: arr.shape Out[397]: () In [398]: np.char.count(arr, 'C') Out[398]: array(10000) In [399]: timeit np.char.count(arr, 'C') 200 µs ± 2.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [400]: arr.dtype Out[400]: dtype('<U60000')
列表计数必须循环遍历元素，并每次针对
C
进行测试。它仍然比sum（列表中i的i='C'）（和变体）快
现在从该列表中创建一个数组-单字符元素：

In [405]: arr1 = np.array(alist) In [406]: arr1.shape Out[406]: (60000,) In [407]: timeit arr1=='C' 634 µs ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [408]: timeit np.sum(arr1=='C') 740 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

np.sum
相对较快。对“C”的检查花费的时间最多
如果我构造一个相同大小的数字数组，计数时间会快很多。对数字的相等性测试比等效字符串测试快

In [431]: arr2 = np.resize(np.array([1,2,3,4,5,6]),arr1.shape[0]) In [432]: np.sum(arr2==3) Out[432]: 10000 In [433]: timeit np.sum(arr2==3) 155 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

numpy
并不能保证所有Python操作都更快。在大多数情况下，在处理字符串元素时，它严重依赖于Python自己的字符串代码
它是如何工作的？你检查过源代码了吗？我认为你的测试不公平，因为
g2=='C'
将创建一个新数组，其中包含
[True，False…]
，而
g1.count（'C'）
只需要跟踪一个数字。
np.sum（g2='C'）
必须迭代数组两次，然后创建一个新数组。一次用于创建索引掩码，其中
g2==“C”
，然后求和
g1.count（'C'）
只需迭代一次。即使一个更公平的测试应该是，而不是
g1.count
，
sum（i=='C'代表基因组中的i）
好吧，numpy是数字python——如果你处理字符串，python的标准操作会得到更好的优化，可能是更好的选择。它是如何工作的？你检查过源代码了吗？我认为你的测试不公平，因为
g2=='C'
将创建一个新数组，其中包含
[True，False…]
，而
g1.count（'C'）
只需要跟踪一个数字。
np.sum（g2='C'）
必须迭代数组两次，然后创建一个新数组。一次用于创建索引掩码，其中
g2==“C”
，然后求和
g1.count（'C'）
只需迭代一次。即使一个更公平的测试应该是，而不是
g1.count
，
sum（i=='C'代表基因组中的i）
嗯，numpy是数字python——如果你处理字符串，python的标准操作会得到更好的优化，可能是更好的选择。