Python 给定两个列表的匹配元素索引，其中一个列表具有冗余项_Python_List_Indexing

Python 给定两个列表的匹配元素索引，其中一个列表具有冗余项

python list indexing

Python 给定两个列表的匹配元素索引，其中一个列表具有冗余项,python,list,indexing,Python,List,Indexing,我有两个列表，a和ba包含我想知道b中匹配元素索引的元素。与a不同，在b中，每个元素都是唯一的 a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005] b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,

我有两个列表，

和

包含我想知道

中匹配元素索引的元素。与

不同，在

中，每个元素都是唯一的

a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]

使用以下来源的解决方案：

匹配

不过只是

[27,28,29,30,32,37,39]

，但我希望它是

[27,27,28,29,30,30,32,37,39]

那怎么办

>>> a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
>>> b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]
>>> [b.index(x) for x in a]
[27, 27, 28, 29, 30, 30, 32, 37, 39, 39]

print [b.index(i) for i in a if i in b]

如果您有大型列表，则将b设为集合将更有效：

st = set(b)
print([b.index(x) for x in a if x in st])

In [34]: %%timeit                                                            
indexes = {k: i for i, k in enumerate(b)}
[indexes[x] for x in a]
   ....: 
100000 loops, best of 3: 7.54 µs per loop

In [39]: b = list(range(1966,2100))
In [40]: samp = list(range(1966,2100))
In [41]: a = [choice(samp) for _ in range(100)]

In [42]: timeit [b.index(x) for x in a 
10000 loops, best of 3: 154 µs per loop   
In [43]: %%timeit                      
indexes = {k: i for i, k in enumerate(b)}
[indexes[x] for x in a]
   ....: 
10000 loops, best of 3: 22.5 µs per loop

当您的数据被排序并假定a中的所有元素都在b中时，您也可以使用，因此每个索引查找都是O（log n）：

在小数据集上，它的运行速度是索引速度的两倍：

In [22]: timeit [bisect_left(b, x) for x in a]
100000 loops, best of 3: 4.2 µs per loop

In [23]: timeit [b.index(x) for x in a]
100000 loops, best of 3: 8.84 µs per loop

另一种选择是使用dict存储索引，这意味着代码将以线性时间运行，一次通过a，一次通过b：

# store all indexes as values and years as keys
indexes = {k: i for i, k in enumerate(b)}
# one pass over a accessing each index in constant time
print [indexes[x] for x in a]
[27, 27, 28, 29, 30, 30, 32, 37, 39, 39]

即使在较小的输入集上，它也比索引更有效，并且随着a的增长，它的效率会更高：

st = set(b)
print([b.index(x) for x in a if x in st])

In [34]: %%timeit                                                            
indexes = {k: i for i, k in enumerate(b)}
[indexes[x] for x in a]
   ....: 
100000 loops, best of 3: 7.54 µs per loop

In [39]: b = list(range(1966,2100))
In [40]: samp = list(range(1966,2100))
In [41]: a = [choice(samp) for _ in range(100)]

In [42]: timeit [b.index(x) for x in a 
10000 loops, best of 3: 154 µs per loop   
In [43]: %%timeit                      
indexes = {k: i for i, k in enumerate(b)}
[indexes[x] for x in a]
   ....: 
10000 loops, best of 3: 22.5 µs per loop

这是Padraic Cunningham的扩展。相反，如果将要编制索引的列表转换为字典，则可以实现O（1）查找，用于O（n）预处理：

a = [1993, 1993, 1994, 1995, 1996, 1996, 1998, 2003, 2005, 2005]
b = [1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]

d = {value: index for index, value in enumerate(b)}
print([d[x] for x in a])



>>> timeit("[bisect_left(b, x) for x in a]", "from __main__ import a, b; from bisect import bisect_left")
3.513558427607279
>>> timeit("[b.index(x) for x in a]", "from __main__ import a, b")
8.010070997323822
>>> timeit("d = {value: index for index, value in enumerate(b)}; [d[x] for x in a]", "from __main__ import a, b")
5.5277420695707065
>>> timeit("[d[x] for x in a]", "from __main__ import a, b, ;d = {value : index for index, value in enumerate(b)}")
1.1214096146165389

因此，如果不考虑预处理，那么在实际处理中使用

b.index

的速度几乎是使用

b.index

的8倍——如果使用大量的列表

，而使用较少的

，这会更好。如果只做一次，使用

bisect_left

会更快，并且可以保证

是单调递增的

在这种情况下有效，但如果

中的任何值不在

中，则会引发ValueError<代码>[b.index（x）for x in a if b.count（x）]会更安全……当然，这完全取决于您是否希望它失败。现在您只需查找两次，一次在集合中，一次在列表中。只有当它不在列表中时，这才是一种改进。@LennartRegebro。所以0（1）查找比线性查找慢，因为b中没有出现大量的值？你认为如果b中有1000000个元素，使用列表会更好吗？不，但是如果它在列表中，OP的问题假设它在列表中，那么你根本不需要if测试。如果它主要在列表中，那么捕获错误会更好。@LennartRegebro。我添加了一个对分方法，它比indexingTrue更有效，如果我们知道b是有序的，它会加速。