Python 熊猫iErrors是否存在性能问题？_Python_Performance_Pandas_Iteration

Python 熊猫iErrors是否存在性能问题？

python performance pandas

Python 熊猫iErrors是否存在性能问题？,python,performance,pandas,iteration,Python,Performance,Pandas,Iteration,我注意到使用熊猫的iTerrow时性能非常差这是别人经历过的吗？它是否特定于iterrows？对于特定大小的数据（我处理的是200-300万行），是否应该避免使用此函数在GitHub上，我相信这是在数据帧中混合数据类型时造成的，但是下面的简单示例显示，即使使用一个数据类型（float64），它也存在。这在我的机器上需要36秒： import pandas as pd import numpy as np import time s1 = np.random.randn(2000000) s

我注意到使用熊猫的iTerrow时性能非常差

这是别人经历过的吗？它是否特定于iterrows？对于特定大小的数据（我处理的是200-300万行），是否应该避免使用此函数

在GitHub上，我相信这是在数据帧中混合数据类型时造成的，但是下面的简单示例显示，即使使用一个数据类型（float64），它也存在。这在我的机器上需要36秒：

import pandas as pd
import numpy as np
import time

s1 = np.random.randn(2000000)
s2 = np.random.randn(2000000)
dfa = pd.DataFrame({'s1': s1, 's2': s2})

start = time.time()
i=0
for rowindex, row in dfa.iterrows():
    i+=1
end = time.time()
print end - start

为什么像apply这样的矢量化操作要快得多？我想那里一定也有一些逐行的迭代

我不知道如何在我的案例中不使用iterrows（这一点我将留作将来的问题）。因此，如果您一直能够避免此迭代，我将不胜感激。我正在根据不同数据帧中的数据进行计算。谢谢大家!

---编辑：下面添加了我想要运行的简化版本---

通常，

iterrows

只应在非常特别的情况下使用。这是执行各种操作的一般优先顺序：

1) vectorization
2) using a custom cython routine
3) apply
    a) reductions that can be performed in cython
    b) iteration in python space
4) itertuples
5) iterrows
6) updating an empty frame (e.g. using loc one-row-at-a-time)

使用自定义Cython例程通常太复杂了，所以现在让我们跳过它

1）矢量化始终是首选和最佳选择。然而，有一小部分病例（通常涉及复发）无法以明显的方式进行矢量化。此外，在较小的

数据帧上

，使用其他方法可能会更快

3）

apply

通常可以由Cython空间中的迭代器处理。虽然这取决于

apply

表达式内部的情况，但它是由pandas内部处理的。例如，

df.apply（lambda x:np.sum（x））

将非常快速地执行，当然，

df.sum（1）

甚至更好。但是，类似于

df.apply（lambda x:x['b']+1）

的代码将在Python空间中执行，因此速度要慢得多

4）

itertuples

不会将数据打包成

系列

。它只是以元组的形式返回数据

5）

iterrows

将数据装箱到

系列中。除非你真的需要，否则请使用另一种方法
6） 一次一行更新空帧。我见过这种方法用得太多了。这是迄今为止最慢的。这可能是常见的地方（对于某些python结构来说相当快），但是DataFrame
会对索引进行大量检查，因此每次更新一行总是非常慢的。创建新结构更好，Numpy和pandas中的向量操作比vanilla Python中的标量操作快得多，原因如下：

摊销类型查找：Python是一种动态类型语言，因此数组中的每个元素都有运行时开销。然而，Numpy（以及熊猫）用C语言进行计算（通常通过Cython）。数组的类型仅在迭代开始时确定；光是这一节省就是最大的胜利之一
更好的缓存：在C数组上迭代是缓存友好的，因此速度非常快。数据帧是一个“面向列的表”，这意味着每一列实际上只是一个数组。因此，您可以在数据帧上执行的本机操作（如对列中的所有元素求和）将很少有缓存未命中
更多并行机会：可以通过SIMD指令对简单的C数组进行操作。Numpy的某些部分启用SIMD，具体取决于您的CPU和安装过程。并行的好处不会像静态类型和更好的缓存那样引人注目，但它们仍然是一个坚实的胜利

这个故事的寓意是：使用Numpy和pandas中的向量运算。它们比Python中的标量操作快，原因很简单，这些操作正是C程序员手工编写的。（除了数组概念比带有嵌入式SIMD指令的显式循环更容易理解之外。）
以下是解决问题的方法。这都是矢量化的
In [58]: df = table1.merge(table2,on='letter')

In [59]: df['calc'] = df['number1']*df['number2']

In [60]: df
Out[60]: 
  letter  number1  number2  calc
0      a       50      0.2    10
1      a       50      0.5    25
2      b      -10      0.1    -1
3      b      -10      0.4    -4

In [61]: df.groupby('letter')['calc'].max()
Out[61]: 
letter
a         25
b         -1
Name: calc, dtype: float64

In [62]: df.groupby('letter')['calc'].idxmax()
Out[62]: 
letter
a         1
b         2
Name: calc, dtype: int64

In [63]: df.loc[df.groupby('letter')['calc'].idxmax()]
Out[63]: 
  letter  number1  number2  calc
1      a       50      0.5    25
2      b      -10      0.1    -1

另一种选择是使用来记录（）
，这比itertuples
和iterrows
都要快
但对于您的情况，还有很多其他类型的改进空间
这是我的最终优化版本
def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    t2info = table2.to_records()
    for index, letter, n1 in table1.to_records():
        t2 = t2info[grouped.groups[letter].values]
        # np.multiply is in general faster than "x * y"
        maxrow = np.multiply(t2.number2, n1).argmax()
        # `[1:]`  removes the index column
        ret.append(t2[maxrow].tolist()[1:])
    global table3
    table3 = pd.DataFrame(ret, columns=('letter', 'number2'))


基准测试：
-- iterrows() --
100 loops, best of 3: 12.7 ms per loop
  letter  number2
0      a      0.5
1      b      0.1
2      c      5.0
3      d      4.0

-- itertuple() --
100 loops, best of 3: 12.3 ms per loop

-- to_records() --
100 loops, best of 3: 7.29 ms per loop

-- Use group by --
100 loops, best of 3: 4.07 ms per loop
  letter  number2
1      a      0.5
2      b      0.1
4      c      5.0
5      d      4.0

-- Avoid multiplication --
1000 loops, best of 3: 1.39 ms per loop
  letter  number2
0      a      0.5
1      b      0.1
2      c      5.0
3      d      4.0


完整代码：
import pandas as pd
import numpy as np

#%% Create the original tables
t1 = {'letter':['a','b','c','d'],
      'number1':[50,-10,.5,3]}

t2 = {'letter':['a','a','b','b','c','d','c'],
      'number2':[0.2,0.5,0.1,0.4,5,4,1]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=table1.index)


print('\n-- iterrows() --')

def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2'] * t1info)
    maxrow_in_t2 = calculation.index(max(calculation))
    return t2info.loc[maxrow_in_t2]

#%% Iterate through filtering relevant data, optimizing, returning info
def iterthrough():
    for row_index, row in table1.iterrows():   
        t2info = table2[table2.letter == row['letter']].reset_index()
        table3.iloc[row_index,:] = optimize(t2info, row['number1'])

%timeit iterthrough()
print(table3)

print('\n-- itertuple() --')
def optimize(t2info, n1):
    calculation = []
    for index, letter, n2 in t2info.itertuples():
        calculation.append(n2 * n1)
    maxrow = calculation.index(max(calculation))
    return t2info.iloc[maxrow]

def iterthrough():
    for row_index, letter, n1 in table1.itertuples():   
        t2info = table2[table2.letter == letter]
        table3.iloc[row_index,:] = optimize(t2info, n1)

%timeit iterthrough()


print('\n-- to_records() --')
def optimize(t2info, n1):
    calculation = []
    for index, letter, n2 in t2info.to_records():
        calculation.append(n2 * n1)
    maxrow = calculation.index(max(calculation))
    return t2info.iloc[maxrow]

def iterthrough():
    for row_index, letter, n1 in table1.to_records():   
        t2info = table2[table2.letter == letter]
        table3.iloc[row_index,:] = optimize(t2info, n1)

%timeit iterthrough()

print('\n-- Use group by --')

def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    for index, letter, n1 in table1.to_records():
        t2 = table2.iloc[grouped.groups[letter]]
        calculation = t2.number2 * n1
        maxrow = calculation.argsort().iloc[-1]
        ret.append(t2.iloc[maxrow])
    global table3
    table3 = pd.DataFrame(ret)

%timeit iterthrough()
print(table3)

print('\n-- Even Faster --')
def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    t2info = table2.to_records()
    for index, letter, n1 in table1.to_records():
        t2 = t2info[grouped.groups[letter].values]
        maxrow = np.multiply(t2.number2, n1).argmax()
        # `[1:]`  removes the index column
        ret.append(t2[maxrow].tolist()[1:])
    global table3
    table3 = pd.DataFrame(ret, columns=('letter', 'number2'))

%timeit iterthrough()
print(table3)

最终版本几乎比原始代码快10倍。该战略是：
使用groupby
避免重复比较值
使用访问原始numpy.records对象

在编译完所有数据之前，不要对DataFrame进行操作
是的，itertuples（）比iterrows（）快。
您可以参考以下文档：
“要在对行进行迭代时保留数据类型，最好使用itertuples（），它返回指定的值的倍数，通常比iterrows更快。”中的详细信息
基准
请勿使用iterrows！
…或iteritems
，或itertuples
。说真的，不要。只要可能，查找您的代码。如果你不相信我
我承认，在数据帧上进行迭代是有合法的用例的，但是有比iter*
系列函数更好的迭代替代方案，即

/
，及
（在极少数情况下）

经常有太多的初学者问一些与ItErrors有关的代码问题。由于这些新用户可能不熟悉矢量化的概念，因此他们将解决问题的代码设想为涉及循环或其他迭代例程的代码。由于不知道如何迭代，它们通常以
import pandas as pd
import numpy as np

#%% Create the original tables
t1 = {'letter':['a','b','c','d'],
      'number1':[50,-10,.5,3]}

t2 = {'letter':['a','a','b','b','c','d','c'],
      'number2':[0.2,0.5,0.1,0.4,5,4,1]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=table1.index)


print('\n-- iterrows() --')

def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2'] * t1info)
    maxrow_in_t2 = calculation.index(max(calculation))
    return t2info.loc[maxrow_in_t2]

#%% Iterate through filtering relevant data, optimizing, returning info
def iterthrough():
    for row_index, row in table1.iterrows():   
        t2info = table2[table2.letter == row['letter']].reset_index()
        table3.iloc[row_index,:] = optimize(t2info, row['number1'])

%timeit iterthrough()
print(table3)

print('\n-- itertuple() --')
def optimize(t2info, n1):
    calculation = []
    for index, letter, n2 in t2info.itertuples():
        calculation.append(n2 * n1)
    maxrow = calculation.index(max(calculation))
    return t2info.iloc[maxrow]

def iterthrough():
    for row_index, letter, n1 in table1.itertuples():   
        t2info = table2[table2.letter == letter]
        table3.iloc[row_index,:] = optimize(t2info, n1)

%timeit iterthrough()


print('\n-- to_records() --')
def optimize(t2info, n1):
    calculation = []
    for index, letter, n2 in t2info.to_records():
        calculation.append(n2 * n1)
    maxrow = calculation.index(max(calculation))
    return t2info.iloc[maxrow]

def iterthrough():
    for row_index, letter, n1 in table1.to_records():   
        t2info = table2[table2.letter == letter]
        table3.iloc[row_index,:] = optimize(t2info, n1)

%timeit iterthrough()

print('\n-- Use group by --')

def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    for index, letter, n1 in table1.to_records():
        t2 = table2.iloc[grouped.groups[letter]]
        calculation = t2.number2 * n1
        maxrow = calculation.argsort().iloc[-1]
        ret.append(t2.iloc[maxrow])
    global table3
    table3 = pd.DataFrame(ret)

%timeit iterthrough()
print(table3)

print('\n-- Even Faster --')
def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    t2info = table2.to_records()
    for index, letter, n1 in table1.to_records():
        t2 = t2info[grouped.groups[letter].values]
        maxrow = np.multiply(t2.number2, n1).argmax()
        # `[1:]`  removes the index column
        ret.append(t2[maxrow].tolist()[1:])
    global table3
    table3 = pd.DataFrame(ret, columns=('letter', 'number2'))

%timeit iterthrough()
print(table3)

import pandas as pd
import numpy as np
import time

s1 = np.random.randn(2000000)
s2 = np.random.randn(2000000)
dfa = pd.DataFrame({'s1': s1, 's2': s2})
columns = list(dfa.columns)
dfa = dfa.values
start = time.time()
i=0
for row in dfa:
    blablabla = row[columns.index('s1')]
    i+=1
end = time.time()
print (end - start)