Python Pandas`try df.loc[x]`vs`x在df.index中`_Python_Pandas

Python Pandas`try df.loc[x]`vs`x在df.index中`

python pandas

Python Pandas`try df.loc[x]`vs`x在df.index中`,python,pandas,Python,Pandas,我有一个只有一列的数据框。我想写一个函数，返回给定键值的列值；或者，如果键不在索引中，则使用不同的（常量）值。我可以想出（至少）两种合理的方法来实现这一点--除了速度之外，还有什么理由一种比另一种更好吗？而w/r/t速度，当len（df）=10k和len（ids_to_check）=20k时，try/except大约慢2倍。这让我感到惊讶，因为另一种方法必须遍历索引两次对这种行为有什么直观的解释吗？使用尝试/除了块 def attempt_1(id_val,df): try:

我有一个只有一列的数据框。我想写一个函数，返回给定键值的列值；或者，如果键不在索引中，则使用不同的（常量）值。我可以想出（至少）两种合理的方法来实现这一点--除了速度之外，还有什么理由一种比另一种更好吗？

而w/r/t速度，当len（df）=10k和len（ids_to_check）=20k时，try/except大约慢2倍。这让我感到惊讶，因为另一种方法必须遍历索引两次对这种行为有什么直观的解释吗？

使用

尝试

除了

块

def attempt_1(id_val,df):
    try:
        return df.loc[id_val]
    except KeyError:
        return constant_val

%timeit [attempt_1(i,df) for i in ids_to_check]

1 loops, best of 3: 480 ms per loop

使用中的

测试索引中是否有id\u val

def attempt_2(id_val,df):
    if id_val in df.index:
        return df.loc[id_val]
    else:
        return constant_val

%timeit [attempt_2(i,df) for i in ids_to_check]

1 loops, best of 3: 235 ms per loop


创建一个测试框架
In [22]: df = DataFrame(dict(A = np.random.randn(10000)))                            

选择一些ID
In [21]: ids_to_check = np.random.choice(np.arange(0,20000),size=10000,replace=False)

你的方法
In [18]: %timeit [attempt_2(i,df) for i in ids_to_check]
1 loops, best of 3: 409 ms per loop

In [16]: %timeit [attempt_1(i,df) for i in ids_to_check]
1 loops, best of 3: 620 ms per loop

一种有效的方法，使用矢量化查找<如果位置值在索引中，则code>isin

返回布尔数组；从中索引是相当快的

然后我重新编制索引以恢复原始索引，并用缺失项的值填充

In [19]: %timeit df.A.loc[df.index.isin(ids_to_check)].reindex(df.index).fillna(-100)
100 loops, best of 3: 6.74 ms per loop

这将返回一个序列；很容易就能返回数据帧

In [20]: df.A.loc[df.index.isin(np.random.choice(np.arange(0,20000),size=10000,replace=False))].reindex(df.index).fillna(-100)
Out[20]: 
0    -100.000000
1      -0.485421
2      -0.397338
3    -100.000000
4       0.573031
5    -100.000000
6       0.359699
7       0.298462
8    -100.000000
9      -1.274819
10   -100.000000
11      0.112869
12   -100.000000
13     -2.251186
14     -0.846211
...
9985   -100.000000
9986     -0.988055
9987     -0.080460
9988   -100.000000
9989      1.007490
9990     -1.454466
9991      0.875455
9992   -100.000000
9993   -100.000000
9994      0.194506
9995   -100.000000
9996   -100.000000
9997   -100.000000
9998     -0.477828
9999     -0.777487
Name: A, Length: 10000, dtype: float64

因此，答案是始终使用矢量化方法，而不是循环