Python 熊猫数据帧搜索是线性时间还是恒定时间？_Python_Pandas_Search_Dataframe_Time Complexity

Python 熊猫数据帧搜索是线性时间还是恒定时间？

python pandas search dataframe time-complexity

Python 熊猫数据帧搜索是线性时间还是恒定时间？,python,pandas,search,dataframe,time-complexity,Python,Pandas,Search,Dataframe,Time Complexity,我有一个超过15000行的数据帧对象df，如： anime_id name genre rating 1234 Kimi no nawa Romance, Comedy 9.31 5678 Stiens;Gate Sci-fi 8.92 我正试图找到有特定动画id的那一排 a_id = "5678" temp = (df.query("anime_id == "+a_id).g

我有一个超过15000行的数据帧对象

df

，如：

anime_id          name              genre    rating
1234      Kimi no nawa    Romance, Comedy     9.31
5678       Stiens;Gate             Sci-fi     8.92

我正试图找到有特定动画id的那一排

a_id = "5678"
temp = (df.query("anime_id == "+a_id).genre)

我只是想知道这个搜索是在固定时间（如字典）还是线性时间（如列表）中完成的。

我无法告诉您它是如何实现的，但在运行一个小测试之后。看起来数据帧布尔掩码更像是线性的

>>> timeit.timeit('dict_data[key]',setup=setup,number = 10000)
0.0005770014540757984
>>> timeit.timeit('df[df.val==key]',setup=setup,number = 10000)
17.583375428628642
>>> timeit.timeit('[i == key for i in dict_data ]',setup=setup,number = 10000)
16.613936403242406

这是一个非常有趣的问题

我认为这取决于以下几个方面：

按索引访问单行（索引已排序且唯一）应具有运行时

O（m）

，其中

m您应该注意，当索引唯一时，即使iloc也比hashmap慢约2个数量级：
df = pd.DataFrame(np.random.randint(0, 10**7, 10**5), columns=['a'])
%timeit df.iloc[random.randint(0,10**5)]
10000 loops, best of 3: 51.5 µs per loop

s = set(np.random.randint(0, 10**7, 10**5))
%timeit random.randint(0,10**7) in s
The slowest run took 9.70 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 615 ns per loop

你的答案中m是什么？
In [54]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5))

In [55]: %timeit df.loc[random.randint(0, 10**4)]
100 loops, best of 3: 12.3 ms per loop

In [56]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 262 µs per loop

In [57]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.78 ms per loop

In [58]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.93 ms per loop

In [64]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5)).sort_index()

In [65]: df.index.is_monotonic_increasing
Out[65]: True

In [66]: %timeit df.loc[random.randint(0, 10**4)]
The slowest run took 9.70 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 478 µs per loop

In [67]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 262 µs per loop

In [68]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.81 ms per loop

In [69]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.95 ms per loop

df = pd.DataFrame(np.random.randint(0, 10**7, 10**5), columns=['a'])
%timeit df.iloc[random.randint(0,10**5)]
10000 loops, best of 3: 51.5 µs per loop

s = set(np.random.randint(0, 10**7, 10**5))
%timeit random.randint(0,10**7) in s
The slowest run took 9.70 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 615 ns per loop