Python itertools.groupby的意外行为
这是观察到的行为:Python itertools.groupby的意外行为,python,python-3.x,python-2.x,itertools,python-internals,Python,Python 3.x,Python 2.x,Itertools,Python Internals,这是观察到的行为: In [4]: x = itertools.groupby(range(10), lambda x: True) In [5]: y = next(x) In [6]: next(x) --------------------------------------------------------------------------- StopIteration Traceback (most recent call
In [4]: x = itertools.groupby(range(10), lambda x: True)
In [5]: y = next(x)
In [6]: next(x)
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
<ipython-input-6-5e4e57af3a97> in <module>()
----> 1 next(x)
StopIteration:
In [7]: y
Out[7]: (True, <itertools._grouper at 0x10a672e80>)
In [8]: list(y[1])
Out[8]: [9]
问题是,您将所有这些调用分组为一个组,因此在第一次
下一次调用后,所有调用都已分组:
import itertools
x = itertools.groupby(range(10), lambda x: True)
key, elements = next(x)
但是元素
是一个生成器,因此您需要立即将其传递到某个结构中,以便于“打印”或“保存”,即列表
:
print('Key: "{}" with value "{}"'.format(key, list(elements)))
然后您的范围(10)
为空,groupy生成器完成:
Key: True with value [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
文件表明
itertools.groupby(iterable,key=None)
[……]
groupby()
的操作类似于Unix中的uniq筛选器。每次键函数的值更改时,它都会生成一个中断或新组(这就是为什么通常需要使用相同的键函数对数据进行排序)。这种行为与SQL的组不同,SQL的组聚合公共元素,而不考虑它们的输入顺序
返回的组本身是一个迭代器,它与groupby()
共享底层iterable。因为源是共享的,所以当`groupby()对象处于高级状态时,上一个组将不再可见。因此,如果以后需要该数据,应将其存储为列表
因此,上一段中的假设是生成的列表将是空列表[]
,因为迭代器已经升级,并且满足停止迭代;但是在CPython中,结果却出人意料
这是因为该迭代器落后于原始迭代器一个项,这是因为groupby
需要查看前面的一个项以查看它是否属于当前组或下一个组,但它必须能够稍后将该项作为新组的第一个项生成
但是,groupby
的currkey
和currvalue
属性在调用时不会重置,因此currvalue
仍然指向迭代器中的最后一项
CPython文档实际上包含了这个等价的代码,它也具有与C版本代码完全相同的行为:
class groupby:
# [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
def __init__(self, iterable, key=None):
if key is None:
key = lambda x: x
self.keyfunc = key
self.it = iter(iterable)
self.tgtkey = self.currkey = self.currvalue = object()
def __iter__(self):
return self
def __next__(self):
while self.currkey == self.tgtkey:
self.currvalue = next(self.it) # Exit on StopIteration
self.currkey = self.keyfunc(self.currvalue)
self.tgtkey = self.currkey
return (self.currkey, self._grouper(self.tgtkey))
def _grouper(self, tgtkey):
while self.currkey == tgtkey:
yield self.currvalue
try:
self.currvalue = next(self.it)
except StopIteration:
return
self.currkey = self.keyfunc(self.currvalue)
值得注意的是,\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu。但关键是线路
self.currvalue = next(self.it) # Exit on StopIteration
当next
抛出StopItertion
时,self.currvalue
仍包含上一组的最后一个键。现在,当y[1]
被放入列表中时,它首先生成self.currvalue
的值,然后在底层迭代器上运行next()
(并再次遇到StopIteration
)
尽管文档中有Python等价物,但其行为与CPython、IronPython、Jython和PyPy中的权威C代码实现完全相同,并给出不同的结果
self.currvalue = next(self.it) # Exit on StopIteration