Python 获取列表中最长重复序列的开始和结束索引
我把一些东西放在一起,我想展示一支球队最长的连胜记录,以及这个连胜的开始和结束日期。例如,如果我有以下两个列表:Python 获取列表中最长重复序列的开始和结束索引,python,list,itertools,Python,List,Itertools,我把一些东西放在一起,我想展示一支球队最长的连胜记录,以及这个连胜的开始和结束日期。例如,如果我有以下两个列表: streak = ["W", "W", "W", "L","W", "W", "W", "W", "W", "L"] dates = ["2016-06-15", "2016-06-14", "2016-06-13", "2016-06-10", "2016-06-09", "2016-06-08", "2016-06-05", "2016-06-03", "2016-06-02"
streak = ["W", "W", "W", "L","W", "W", "W", "W", "W", "L"]
dates = ["2016-06-15", "2016-06-14", "2016-06-13", "2016-06-10", "2016-06-09", "2016-06-08", "2016-06-05", "2016-06-03", "2016-06-02", "2016-06-02"]
然后,如果我想获得最长的连胜,我可以做如下事情:
from itertools import groupby
longest = sorted([list(y) for x, y in groupby(streak)], key = len)[-1]
print longest
["W", "W", "W", "W", "W"]
现在我的想法是(让我知道这是否可以做得更好)以某种方式获得这一最长连胜的开始和结束指数,因此在这种情况下:
start, end = get_indices(streak, longest) # 8, 4
print "The longest streak of {}{} was from {} to {}.".format(len(longest), longest[0], dates[start], dates[end])
"The longest streak of 5W was from 2016-06-02 to 2016-06-09.
我该怎么做?或者有没有更好的方法来实现这一点,例如将列表压缩在一起并使用它做一些事情?考虑到您的代码,您仍然可以继续使用
itertools
并使用underdogtakewhile
:
from itertools import takewhile, groupby
import itertools
L = [list(y) for x, y in groupby(streak)]
l = sorted(L, key=len)[-1]
ix = len(list(itertools.chain.from_iterable(takewhile(lambda x: x!=l, L))))
print("the longest streak goes from " + dates[ix+len(l)] + " to " + dates[ix])
#the longest streak goes from 2016-06-02 to 2016-06-09
减少临时性的替代解决方案(但请注意,除非RAM严重受限或产生不合理的巨大条纹,否则生成临时性比最小临时性替代方案更快)。没有必要,只是演示了组合迭代器相关工具以获得相同结果的其他方法:
from itertools import groupby, tee, zip_longest
from operator import itemgetter, sub
def longeststreak(streaks, dates):
# Create parallel iterators over the first index of each new group
s, e = tee(map(next, map(itemgetter(1), groupby(range(len(streaks)), key=streaks.__getitem__))))
# Advance end iterator so we can zip at offset to create start/end index pairs
next(e, None)
# Find greatest difference between start and end
longend, longstart = max(zip_longest(e, s, fillvalue=len(streaks)), key=lambda es: sub(*es))
# return dates for those indices (must subtract one from end since end index is exclusive)
return dates[longend-1], dates[longstart]
或另一种方法:
from collections import deque
from itertools import groupby
from operator import itemgetter, sub
def longeststreak(streaks, dates):
# Generator of grouped indices for each streak
streakgroups = map(itemgetter(1), groupby(range(len(streaks)), streaks.__getitem__))
# Get first and last index of each streak without storing intermediate indices
streakranges = ((next(iter(deque(g, 1)), start), start) for g in streakgroups for start in (next(g),))
# As before, find greatest difference and return range
longend, longstart = max(streakranges, key=lambda es: sub(*es))
# End index is inclusive in this design, so don't subtract 1
return dates[longend], dates[longstart]
在这两种情况下,如果在Py2上,您都需要从未来的内置项导入map
,对于前者,使用izip\u longest
此外,为了完整起见,优化版本的旨在最大限度地减少字节码执行(在CPython中速度较慢),以支持更多的C级执行(在CPython中速度较快):
作为记录,由于试图避免创建包含整个条纹的列表
/元组
s所涉及的各种开销,当我们只需要开始和结束时,我的两个备选解决方案在基本上所有实际数据上运行较慢;在我的机器上,在ipython3
(Python 3.5 x86-64 for Linux)上,一个包含随机条纹长度的测试用例,总共有450K个条目,用上校答案的优化版本处理大约需要35毫秒,用我的第一个解决方案处理大约50毫秒,tee
,用我的第二个解决方案处理大约77毫秒,deque
使用解决方案。只需提一下,您的日期从列表的末尾到开头是“颠倒的”。。。我在回答时考虑到了这一点。您需要使用日期[ix+len(l)-1]
(注意添加了-1
)或者超出了结束日期(您在样本数据上没有注意到这一点,因为样本数据在一条条纹的结束和下一条条纹的开始重复相同的日期,但如果日期是唯一的,则可能是错误的)。2.命名变量L
,尤其是L
,实际上是维护者的敌意,这使得很难区分变量名与常量int
1
、名称I
(大写I)或按位或运算符|
,这取决于字体。好奇您是否知道找到最长“W”的方法溪流和最长的“L”条纹?因为它们可能都是相同的长度,我想检查一下。好奇你是否知道找到最长的“W”流和最长的“L”条纹的方法?
def longeststreak(streaks, dates):
# Use map with C-level builtins to reduce bytecode use
streakgroups = list(map(list, map(itemgetter(1), groupby(streaks))))
# Use max with key instead of sorted followed by indexing at -1, to turn
# O(n log n) work into O(n) work
longeststreak = max(streakgroups, key=len)
# Replace lambda with C layer built-in comparator
ix = len(list(chain.from_iterable(takewhile(longeststreak.__ne__, streakgroups))))
# Added -1 missing in original answer; end index should be exclusive,
# so we need to subtract 1; not noticeable on sample data because sample
# data had same data at end of longest streak and beginning of next
return dates[ix+len(longeststreak)-1], dates[ix]