Python 查找元素'；熊猫系列中的s指数_Python_Pandas

Python 查找元素'；熊猫系列中的s指数

python pandas

Python 查找元素'；熊猫系列中的s指数,python,pandas,Python,Pandas,我知道这是一个非常基本的问题，但由于某种原因我找不到答案。如何获取python中某个系列元素的索引？（第一次就足够了）也就是说，我想要类似于： import pandas as pd myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4]) print myseries.find(7) # should output 3 当然，可以用循环定义这样的方法： def find(s, el): for i in s.index:

我知道这是一个非常基本的问题，但由于某种原因我找不到答案。如何获取python中某个系列元素的索引？（第一次就足够了）

也就是说，我想要类似于：

import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3

当然，可以用循环定义这样的方法：

def find(s, el):
    for i in s.index:
        if s[i] == el: 
            return i
    return None

print find(myseries, 7)

但我想应该有更好的办法。有吗

>>> myseries[myseries == 7]
3    7
dtype: int64
>>> myseries[myseries == 7].index[0]
3

虽然我承认应该有更好的方法来实现这一点，但这至少避免了对象的迭代和循环，并将其移动到C级别。

转换为索引，您可以使用

get\u loc

In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])

In [3]: Index(myseries).get_loc(7)
Out[3]: 3

In [4]: Index(myseries).get_loc(10)
KeyError: 10

重复处理

In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)

如果非连续返回，将返回布尔数组

In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False,  True, False, False,  True, False], dtype=bool)

在内部使用哈希表，速度非常快

In [7]: s = Series(randint(0,10,10000))

In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop

In [12]: i = Index(s)

In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop

正如Viktor所指出的，创建索引需要一次性的创建开销（当您实际使用索引时，例如，

是唯一的

）

另一种方法是：

s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])

list(s).index(7)

返回：三,

使用我正在使用的当前数据集进行实时测试（随机考虑）：

如果你事先知道7在那里，这就行了。你可以跟我核对一下（myseries==7）.any（）

另一种方法（与第一个答案非常相似）也可以解释多个7（或无）

若您使用numpy，您可以获得一个索引数组，其中包含您的值：

import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)

这将返回一个包含索引数组的单元素元组，其中7是myseries中的值：

(array([3], dtype=int64),)

您可以使用Series.idxmax（）

您的值通常出现在多个索引中：

>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')

我对这里所有的答案印象深刻。这不是一个新的答案，只是试图总结所有这些方法的时间安排。我考虑了一个包含25个元素的序列的情况，并假设索引可以包含任何值，并且您希望索引值对应于序列末尾的搜索值

以下是2013款MacBook Pro上的速度测试，采用Python 3.7，版本为0.25.3

In [1]: import pandas as pd                                                

In [2]: import numpy as np                                                 

In [3]: data = [406400, 203200, 101600,  76100,  50800,  25400,  19050,  12700, 
   ...:          9500,   6700,   4750,   3350,   2360,   1700,   1180,    850, 
   ...:           600,    425,    300,    212,    150,    106,     75,     53, 
   ...:            38]                                                                               

In [4]: myseries = pd.Series(data, index=range(1,26))                                                

In [5]: myseries[21]                                                                                 
Out[5]: 150

In [7]: %timeit myseries[myseries == 150].index[0]                                                   
416 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit myseries[myseries == 150].first_valid_index()                                        
585 µs ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %timeit myseries.where(myseries == 150).first_valid_index()                                  
652 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit myseries.index[np.where(myseries == 150)[0][0]]                                     
195 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit pd.Series(myseries.index, index=myseries)[150]                 
178 µs ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [12]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]                                    
77.4 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %timeit myseries.index[list(myseries).index(150)]
12.7 µs ± 42.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [14]: %timeit myseries.index[myseries.tolist().index(150)]                   
9.46 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@杰夫的答案似乎是最快的——尽管它不处理重复的问题

更正：对不起，我错过了一个，@Alex Spangher使用列表索引方法的解决方案是迄今为止最快的

更新：添加了@EliadL的答案

希望这有帮助

令人惊讶的是，如此简单的操作需要如此复杂的解决方案，而且许多解决方案非常缓慢。在某些情况下，超过半毫秒才能在25的序列中找到一个值。

另一种尚未提及的方法是tolist方法：

myseries.tolist().index(7)

假设序列中存在该值，则应返回正确的索引。

这是我所能找到的最自然和可扩展的方法：

>>> myindex = pd.Series(myseries.index, index=myseries)

>>> myindex[7]
3

>>> myindex[[7, 5, 7]]
7    3
5    4
7    3
dtype: int64

@杰夫，如果你有一个更有趣的索引，那就不那么容易了。。。但是我想你可以只做

s.index[\u]

这似乎只返回找到max元素的索引，而不是像所问的问题那样返回某个元素的特定

索引。这里的问题是，它假设正在搜索的元素实际上在列表中。令人遗憾的是，熊猫似乎没有内置的查找操作。此解决方案仅在序列具有顺序整数索引时有效。如果你的系列索引是按日期时间，这是行不通的。提前知道7是一个元素的要点是正确的。但是，使用任何
检查并不理想，因为需要进行双重迭代。有一个很酷的术后检查，它将揭示您可以看到的所有False
条件。小心，如果没有元素与此条件匹配，argmax
仍将返回0（而不是出错）。@Alex Spangher在2014年9月17日提出了类似的建议。看看他的答案。我现在已经在测试结果中添加了这两个版本。谢谢。但是你不应该在创建myindex
之后进行测量吗，因为它只需要创建一次？你可以说，但这取决于需要多少类似这样的查找。如果要多次查找，只需要创建myindex
系列。对于这个测试，我假设它只需要一次，并且总的执行时间很重要。只是今晚遇到了这个需要，在多个查找中对同一个索引对象使用.get_lock（）似乎应该是最快的。我认为对答案的一个改进是提供两个时间：包括索引创建，以及在创建后仅查找的另一个时间。是的，这一点很好@埃利亚德还说。这取决于有多少应用程序是静态的。如果序列中的任何值发生更改，则需要重建pd.Index（myseries）。为了公平对待其他方法，我假设自上次查找以来原始系列可能已经更改。这是我找到的最佳解决方案。
>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>> 

>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')

In [1]: import pandas as pd                                                

In [2]: import numpy as np                                                 

In [3]: data = [406400, 203200, 101600,  76100,  50800,  25400,  19050,  12700, 
   ...:          9500,   6700,   4750,   3350,   2360,   1700,   1180,    850, 
   ...:           600,    425,    300,    212,    150,    106,     75,     53, 
   ...:            38]                                                                               

In [4]: myseries = pd.Series(data, index=range(1,26))                                                

In [5]: myseries[21]                                                                                 
Out[5]: 150

In [7]: %timeit myseries[myseries == 150].index[0]                                                   
416 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit myseries[myseries == 150].first_valid_index()                                        
585 µs ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %timeit myseries.where(myseries == 150).first_valid_index()                                  
652 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit myseries.index[np.where(myseries == 150)[0][0]]                                     
195 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit pd.Series(myseries.index, index=myseries)[150]                 
178 µs ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [12]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]                                    
77.4 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %timeit myseries.index[list(myseries).index(150)]
12.7 µs ± 42.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [14]: %timeit myseries.index[myseries.tolist().index(150)]                   
9.46 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

myseries.tolist().index(7)

>>> myindex = pd.Series(myseries.index, index=myseries)

>>> myindex[7]
3

>>> myindex[[7, 5, 7]]
7    3
5    4
7    3
dtype: int64