Python:无法执行切片索引
我正在尝试使用熊猫多索引数据帧,该数据帧如下所示:Python:无法执行切片索引,python,pandas,dataframe,multi-index,Python,Pandas,Dataframe,Multi Index,我正在尝试使用熊猫多索引数据帧,该数据帧如下所示: end ref|alt chrom start chr1 3000714 3000715 T|G 3001065 3001066 G|T 3001110 3001111 G|C 3001131 3001132 G|A 我希望能够做到这一点: df.loc[('chr1', slice(3000714, 3001110))] 该
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001065 3001066 G|T
3001110 3001111 G|C
3001131 3001132 G|A
我希望能够做到这一点:
df.loc[('chr1', slice(3000714, 3001110))]
该操作失败,并出现以下错误:
无法使用的这些索引器[1204741]对执行切片索引
df.index.levels[1].dtype
返回dtype('int64')
,所以它应该可以处理整数切片,对吗
此外,任何关于如何有效执行此操作的注释都是有价值的,因为数据帧有1200万行,我需要使用这种切片查询进行大约7000万次的查询。我认为您需要在末尾添加
,:
-这意味着您需要切片行,但需要所有列:
print (df.loc[('chr1', slice(3000714, 3001110)),:])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001065 3001066 G|T
3001110 3001111 G|C
另一种解决方案是将轴=0添加到:
但如果只需要3000714
和3001110
:
print (df.loc[('chr1', [3000714, 3001110]),:])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001110 3001111 G|C
idx = pd.IndexSlice
print (df.loc[idx['chr1', [3000714, 3001110]],:])
end ref|alt
chrom start
chr1 3000714 3000715 T|G
3001110 3001111 G|C
计时:
In [21]: %timeit (df.loc[('chr1', slice(3000714, 3001110)),:])
1000 loops, best of 3: 757 µs per loop
In [22]: %timeit (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
1000 loops, best of 3: 743 µs per loop
In [23]: %timeit (df.loc[('chr1', [3000714, 3001110]),:])
1000 loops, best of 3: 824 µs per loop
In [24]: %timeit (df.loc[pd.IndexSlice['chr1', [3000714, 3001110]],:])
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 826 µs per loop
太棒了,效果很好。谢谢你的解释。我还意识到,在我的例子中,由于我的第一级索引比第二级索引小得多(在
level[0]
索引中有23项,在level[1]
索引中有1260万项),我通过在第一级索引上将数据框拆分成一个字典来获得更快的速度。在我的完整数据帧上,df.loc(axis=0)[('chr1',slice(3000714,3001110))
方法每循环花费218毫秒,而制作字典和执行dfs['chr1'].loc[3000714:3001110]
每循环只花费95.7微秒。再次感谢@jezrael,我如何从一个索引到另一个索引选择一个数据帧..在这个范围内..我有一个函数,users.index=np.arange(0,len(users))这不返回任何内容…users.loc[start:end:]空的数据帧,但是users.dataframe有内容
In [21]: %timeit (df.loc[('chr1', slice(3000714, 3001110)),:])
1000 loops, best of 3: 757 µs per loop
In [22]: %timeit (df.loc(axis=0)[('chr1', slice(3000714, 3001110))])
1000 loops, best of 3: 743 µs per loop
In [23]: %timeit (df.loc[('chr1', [3000714, 3001110]),:])
1000 loops, best of 3: 824 µs per loop
In [24]: %timeit (df.loc[pd.IndexSlice['chr1', [3000714, 3001110]],:])
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 826 µs per loop