Python 一种更好的方法将序列分割成重叠的块?

Python 一种更好的方法将序列分割成重叠的块?,python,split,overlap,Python,Split,Overlap,我需要一个函数来将一个iterable拆分为多个块,并选择在这些块之间重叠 我写了下面的代码,它给了我正确的输出,但是效率很低(很慢)。我想不出如何加快速度。有更好的方法吗 def split_overlap(seq, size, overlap): '''(seq,int,int) => [[...],[...],...] Split a sequence into chunks of a specific size and overlap. Works also

我需要一个函数来将一个iterable拆分为多个块,并选择在这些块之间重叠

我写了下面的代码,它给了我正确的输出,但是效率很低(很慢)。我想不出如何加快速度。有更好的方法吗

def split_overlap(seq, size, overlap):
    '''(seq,int,int) => [[...],[...],...]
    Split a sequence into chunks of a specific size and overlap.
    Works also on strings! 

    Examples:
        >>> split_overlap(seq=list(range(10)),size=3,overlap=2)
        [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7], [6, 7, 8], [7, 8, 9]]

        >>> split_overlap(seq=range(10),size=3,overlap=2)
        [range(0, 3), range(1, 4), range(2, 5), range(3, 6), range(4, 7), range(5, 8), range(6, 9), range(7, 10)]

        >>> split_overlap(seq=list(range(10)),size=7,overlap=2)
        [[0, 1, 2, 3, 4, 5, 6], [5, 6, 7, 8, 9]]
    '''
    if size < 1 or overlap < 0:
        raise ValueError('"size" must be an integer with >= 1 while "overlap" must be >= 0')
    result = []
    while True:
        if len(seq) <= size:
            result.append(seq)
            return result
        else:
            result.append(seq[:size])
            seq = seq[size-overlap:]
以较长的列表作为输入:

l = list(range(10))
s = 4
o = 2
print(split_overlap(l,s,o))
print(list(split_overlap_jdehesa(l,s,o)))
print(list(nwise_overlap(l,s,o)))
print(list(split_overlap_Moinuddin(l,s,o)))
print(list(gen_split_overlap(l,s,o)))
print(list(itr_split_overlap(l,s,o)))

[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9]]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9)]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9), (8, 9, None, None)] #wrong
[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9], [8, 9]] #wrong
[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9]]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9)]

%%timeit
split_overlap(l,7,2)
718 ns ± 2.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
list(split_overlap_jdehesa(l,7,2))
4.02 µs ± 64.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(nwise_overlap(l,7,2))
5.05 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(split_overlap_Moinuddin(l,7,2))
3.89 µs ± 78.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(gen_split_overlap(l,7,2))
1.22 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
list(itr_split_overlap(l,7,2))
3.41 µs ± 36.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
l = list(range(100000))

%%timeit
split_overlap(l,7,2)
4.27 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
list(split_overlap_jdehesa(l,7,2))
31.1 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
list(nwise_overlap(l,7,2))
5.74 ms ± 66 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(split_overlap_Moinuddin(l,7,2))
16.9 ms ± 89.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(gen_split_overlap(l,7,2))
4.54 ms ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(itr_split_overlap(l,7,2))
19.1 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Chunk size: 3
# Overlap: 2 
>>> split_overlap(list(range(10)), 3, 2)
[(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6), (5, 6, 7), (6, 7, 8), (7, 8, 9)]

# Chunk size: 3
# Overlap: 1
>>> split_overlap(list(range(10)), 3, 1)
[(0, 1, 2), (2, 3, 4), (4, 5, 6), (6, 7, 8)]

# Chunk size: 4
# Overlap: 1
>>> split_overlap(list(range(10)), 4, 1)
[(0, 1, 2, 3), (3, 4, 5, 6), (6, 7, 8, 9)]

# Chunk size: 4
# Overlap: 2
>>> split_overlap(list(range(10)), 4, 2)
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9)]

# Chunk size: 4
# Overlap: 1
>>> split_overlap(list(range(10)), 4, 3)
[(0, 1, 2, 3), (1, 2, 3, 4), (2, 3, 4, 5), (3, 4, 5, 6), (4, 5, 6, 7), (5, 6, 7, 8), (6, 7, 8, 9)]

从其他测试(此处未报告)中可以看出,对于小列表
len(list),您的方法几乎达到了预期效果,您需要轮询序列/iterable并构建块,但在任何情况下,这里有一个惰性版本,它与iterables一起工作,并使用一个用于性能:

from collections import deque

def split_overlap(iterable, size, overlap=0):
    size = int(size)
    overlap = int(overlap)
    if size < 1 or overlap < 0 or overlap >= size:
        raise ValueError()
    pops = size - overlap
    q = deque(maxlen=size)
    for elem in iterable:
        q.append(elem)
        if len(q) == size:
            yield tuple(q)
            for _ in range(pops):
                q.popleft()
    # Yield final incomplete tuple if necessary
    if len(q) > overlap:
        yield tuple(q)

>>> list(split_overlap(range(10), 4, 2))
[(0, 1, 2, 3), (3, 4, 5, 6), (6, 7, 8, 9)]
>>> list(split_overlap(range(10), 5, 2))
[(0, 1, 2, 3, 4), (3, 4, 5, 6, 7), (6, 7, 8, 9)]
从集合导入数据
def分割重叠(iterable,大小,重叠=0):
大小=整数(大小)
重叠=整数(重叠)
如果大小<1或重叠<0或重叠>=大小:
提升值错误()
pops=大小-重叠
q=deque(maxlen=size)
对于iterable中的元素:
q、 附加(元素)
如果len(q)=尺寸:
产量元组(q)
对于范围内的(pops):
q、 popleft()
#如果需要,生成最终的不完整元组
如果len(q)>重叠:
产量元组(q)
>>>列表(分割重叠(范围(10)、4、2))
[(0, 1, 2, 3), (3, 4, 5, 6), (6, 7, 8, 9)]
>>>列表(分割重叠(范围(10)、5、2))
[(0, 1, 2, 3, 4), (3, 4, 5, 6, 7), (6, 7, 8, 9)]
注意:事实上,如果输入没有生成精确数量的块,那么生成器会生成最后一个不完整的元组(参见第二个示例)。如果要避免这种情况,请删除最后的
If len(q)>重叠:yield tuple(q)

,您可以尝试使用

itertools.izip(...)
这对于大型列表很好,因为它返回的是迭代器而不是列表

像这样:

import itertools
def split_overlap(iterable, size, overlap):
    '''(iter,int,int) => [[...],[...],...]
    Split an iterable into chunks of a specific size and overlap.
    Works also on strings! 

    Examples:
        >>> split_overlap(iterable=list(range(10)),size=3,overlap=2)
        [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7], [6, 7, 8], [7, 8, 9]]

        >>> split_overlap(iterable=range(10),size=3,overlap=2)
        [range(0, 3), range(1, 4), range(2, 5), range(3, 6), range(4, 7), range(5, 8), range(6, 9), range(7, 10)]
    '''
    if size < 1 or overlap < 0:
        raise ValueError('"size" must be an integer with >= 1 while "overlap" must be >= 0')
    result = []
    for i in itertools.izip(*[iterable[i::size-overlap] for i in range(size)]):
        result.append(i)
    return result
导入itertools
def分割重叠(可缩放、大小、重叠):
''(国际热核实验堆,国际热核聚变实验堆,国际热核聚变实验堆)=>
将iterable拆分为特定大小的块并重叠。
也适用于字符串!
示例:
>>>拆分重叠(iterable=list(范围(10)),大小=3,重叠=2)
[[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7], [6, 7, 8], [7, 8, 9]]
>>>分割重叠(iterable=范围(10),大小=3,重叠=2)
[射程(0,3)、射程(1,4)、射程(2,5)、射程(3,6)、射程(4,7)、射程(5,8)、射程(6,9)、射程(7,10)]
'''
如果尺寸小于1或重叠小于0:
raise VALUERROR(““大小”必须是大于等于1的整数,而“重叠”必须大于等于0”)
结果=[]
对于itertools.izip中的i(*[iterable[i::size overlap]对于范围内的i(size)]:
结果.追加(i)
返回结果
如果必须满足区块大小的标准(并从末尾丢弃不满足区块大小标准的剩余区块) 您可以使用和列表理解创建自定义函数,以实现以下目的:

def split_overlap(seq, size, overlap):
     return [x for x in zip(*[seq[i::size-overlap] for i in range(size)])]
样本运行:

l = list(range(10))
s = 4
o = 2
print(split_overlap(l,s,o))
print(list(split_overlap_jdehesa(l,s,o)))
print(list(nwise_overlap(l,s,o)))
print(list(split_overlap_Moinuddin(l,s,o)))
print(list(gen_split_overlap(l,s,o)))
print(list(itr_split_overlap(l,s,o)))

[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9]]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9)]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9), (8, 9, None, None)] #wrong
[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9], [8, 9]] #wrong
[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9]]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9)]

%%timeit
split_overlap(l,7,2)
718 ns ± 2.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
list(split_overlap_jdehesa(l,7,2))
4.02 µs ± 64.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(nwise_overlap(l,7,2))
5.05 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(split_overlap_Moinuddin(l,7,2))
3.89 µs ± 78.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(gen_split_overlap(l,7,2))
1.22 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
list(itr_split_overlap(l,7,2))
3.41 µs ± 36.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
l = list(range(100000))

%%timeit
split_overlap(l,7,2)
4.27 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
list(split_overlap_jdehesa(l,7,2))
31.1 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
list(nwise_overlap(l,7,2))
5.74 ms ± 66 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(split_overlap_Moinuddin(l,7,2))
16.9 ms ± 89.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(gen_split_overlap(l,7,2))
4.54 ms ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(itr_split_overlap(l,7,2))
19.1 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Chunk size: 3
# Overlap: 2 
>>> split_overlap(list(range(10)), 3, 2)
[(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6), (5, 6, 7), (6, 7, 8), (7, 8, 9)]

# Chunk size: 3
# Overlap: 1
>>> split_overlap(list(range(10)), 3, 1)
[(0, 1, 2), (2, 3, 4), (4, 5, 6), (6, 7, 8)]

# Chunk size: 4
# Overlap: 1
>>> split_overlap(list(range(10)), 4, 1)
[(0, 1, 2, 3), (3, 4, 5, 6), (6, 7, 8, 9)]

# Chunk size: 4
# Overlap: 2
>>> split_overlap(list(range(10)), 4, 2)
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9)]

# Chunk size: 4
# Overlap: 1
>>> split_overlap(list(range(10)), 4, 3)
[(0, 1, 2, 3), (1, 2, 3, 4), (2, 3, 4, 5), (3, 4, 5, 6), (4, 5, 6, 7), (5, 6, 7, 8), (6, 7, 8, 9)]
如果还需要从末端开始的剩余块不符合块大小标准 如果您想在块大小不满足先决条件的情况下显示块,那么您应该在Python3.x中使用(相当于Python2.x中的)

此外,这是一个动态生成值的变体,在您有大量列表的情况下,它在内存方面更有效:

# Python 3.x
from itertools import zip_longest as iterzip

# Python 2.x
from itertools import izip_longest as iterzip

# Generator function
def split_overlap(seq, size, overlap):
    for x in iterzip(*[my_list[i::size-overlap] for i in range(size)]):
        yield tuple(i for i in x if i!=None) if x[-1]==None else x
        #      assuming that your initial list is  ^
        #      not containing the `None`, use of `iterzip` is based
        #      on the same assumption  
样本运行:

#     v  type-cast to list in order to display the result, 
#     v  not required during iterations
>>> list(split_overlap(list(range(10)),7,2))
[[0, 1, 2, 3, 4, 5, 6], [5, 6, 7, 8, 9]]

有时,可读性与速度相比很重要。一个简单的生成器可以迭代索引,生成切片,从而在合理的时间内完成工作:

def gen_split_overlap(seq, size, overlap):        
    if size < 1 or overlap < 0:
        raise ValueError('size must be >= 1 and overlap >= 0')

    for i in range(0, len(seq) - overlap, size - overlap):            
        yield seq[i:i + size]

您真的需要构建并返回列表列表吗?
itertools
中的模式是返回一个元组迭代器,并进行延迟计算。无论如何,如果这是你认为可以改进的工作代码,也许你会看到。迭代器也可以。事实上,它不太工作,至少没有文档记录,因为它只在序列上工作,而不是iterables。range()是Iterable当你说“相当低效”时,你是什么意思?我喜欢这个!虽然它没有返回预期的结果,但是分割不均匀<代码>列表(拆分重叠([0]*10,7,2))=[(0,0,0,0,0,0,0)]
!=<代码>[(0,0,0,0,0,0,0,0),(0,0,0,0,0,0)]@JamesSchinner更正,因为在这种情况下,下一个块不满足大小为
7
的先决条件。在获得第一个区块后,剩余的元素将是
3
,允许重叠
2
,符合第二个区块条件的元素是
5
。但是,由于所需的块大小是
7
,所以我没有提到它,因为OP对我的答案做了相同的评论。不过很容易修复,
zip\u longest
@jamesschiner在这种情况下,OP的评论是错误的。根据问题中提到的要求,这是程序的预期行为。
itr\u split\u overlap()
没有返回正确的结果我对
zip
的参与太深了,以至于我的大脑*没有想到做直接切片(+1)您的代码不会产生正确的输出。尝试打印(拆分重叠视图(列表范围(10,7,2))