Python 如何在块中迭代文件?
我有一个文件(foo.txt),其排序如下(列0被分组): 如何在Python 如何在块中迭代文件?,python,Python,我有一个文件(foo.txt),其排序如下(列0被分组): 如何在line.split()[0]的块中迭代文件?我知道发电机可以做到这一点,但我不完全确定如何做到。基本上,我想这样做: def first_column_grouping(file): yield some_list ## How? with open("foo.txt") as file: for group in first_column_grouping(file): ## 3 values
line.split()[0]
的块中迭代文件?我知道发电机可以做到这一点,但我不完全确定如何做到。基本上,我想这样做:
def first_column_grouping(file):
yield some_list ## How?
with open("foo.txt") as file:
for group in first_column_grouping(file): ## 3 values
print group
预期产出:
["1 foo bar", "1 lorem ipsum gypsum", "1 baba loo too"]
["2 hello goodbye seeya"]
["3 kobe magic wilt", 3 foo sneaks bar", "3 more stuff", "3 last line in file"]
因此,实际上您需要由
itertools.groupby
提供的功能。如果您的第一列已排序,则此操作将起作用:
>>> from itertools import groupby
>>> from operator import itemgetter
>>> with io.StringIO(s) as f:
... for k, g in groupby(f, itemgetter(0)):
... print(list(g))
...
['1 foo bar\n', '1 lorem ipsum gypsum\n', '1 baba loo too\n']
['2 hello goodbye seeya\n']
['3 kobe magic wilt\n', '3 foo sneaks bar\n', '3 more stuff\n', '3 last line in file']
>>>
如果您想稍微清理一下输出,可以将str.split
映射到您的组中:
>>> with io.StringIO(s) as f:
... for k, g in groupby(f, itemgetter(0)):
... print(list(map(str.strip, g)))
...
['1 foo bar', '1 lorem ipsum gypsum', '1 baba loo too']
['2 hello goodbye seeya']
['3 kobe magic wilt', '3 foo sneaks bar', '3 more stuff', '3 last line in file']
如果您想从头开始实现这一点,那么一个不灵活、幼稚的生成器可能看起来像这样:
>>> def groupby_first_column(f):
... line = next(f)
... k = line[0]
... group = [line]
... for line in f:
... if line[0] == k:
... group.append(line)
... else:
... yield group
... group = [line]
... k = line[0]
... yield group
...
>>> with io.StringIO(s) as f:
... for group in groupby_first_column(f):
... print(list(group))
...
['1 foo bar\n', '1 lorem ipsum gypsum\n', '1 baba loo too\n']
['2 hello goodbye seeya\n']
['3 kobe magic wilt\n', '3 foo sneaks bar\n', '3 more stuff\n', '3 last line in file']
>>>
警告仅当每行的第一列正好位于第一个位置且长度仅为1个字符时,上述生成器才起作用。这并不意味着非常有用,只是为了说明这个想法。如果你想自己动手,你必须更加彻底,因此,你实际上想要的是
itertools.groupby提供的功能。如果您的第一列已排序,则此操作将起作用:
>>> from itertools import groupby
>>> from operator import itemgetter
>>> with io.StringIO(s) as f:
... for k, g in groupby(f, itemgetter(0)):
... print(list(g))
...
['1 foo bar\n', '1 lorem ipsum gypsum\n', '1 baba loo too\n']
['2 hello goodbye seeya\n']
['3 kobe magic wilt\n', '3 foo sneaks bar\n', '3 more stuff\n', '3 last line in file']
>>>
如果您想稍微清理一下输出,可以将str.split
映射到您的组中:
>>> with io.StringIO(s) as f:
... for k, g in groupby(f, itemgetter(0)):
... print(list(map(str.strip, g)))
...
['1 foo bar', '1 lorem ipsum gypsum', '1 baba loo too']
['2 hello goodbye seeya']
['3 kobe magic wilt', '3 foo sneaks bar', '3 more stuff', '3 last line in file']
如果您想从头开始实现这一点,那么一个不灵活、幼稚的生成器可能看起来像这样:
>>> def groupby_first_column(f):
... line = next(f)
... k = line[0]
... group = [line]
... for line in f:
... if line[0] == k:
... group.append(line)
... else:
... yield group
... group = [line]
... k = line[0]
... yield group
...
>>> with io.StringIO(s) as f:
... for group in groupby_first_column(f):
... print(list(group))
...
['1 foo bar\n', '1 lorem ipsum gypsum\n', '1 baba loo too\n']
['2 hello goodbye seeya\n']
['3 kobe magic wilt\n', '3 foo sneaks bar\n', '3 more stuff\n', '3 last line in file']
>>>
警告仅当每行的第一列正好位于第一个位置且长度仅为1个字符时,上述生成器才起作用。这并不意味着非常有用,只是为了说明这个想法。如果你想自己滚动,你必须更彻底这是一个变体(这里是你的文件,在with
语句中):
这将迭代文件,不需要将整个文件保存在内存中。因此,仅对相邻的线进行分组
(您似乎正在使用python2:文件
不是一个好的变量名-因为它是内置的)这是一个变体(伪文件
这里只是语句中的文件
):
这将迭代文件,不需要将整个文件保存在内存中。因此,仅对相邻的线进行分组
(您似乎正在使用python2:文件
不是一个好的变量名,因为它是内置的)这就是itertools.groupby的作用,不过我认为您需要将整个文件读入内存才能做到这一点
import itertools
with open("path/to/file") as f:
data = f.readlines() # a list of the lines of the file
groups = itertools.groupby(data, key=lambda line: line.split()[0])
# group on the first column of each line. This produces something like:
# [ ("1", ["1 foo bar", "1 lorem ipsum gypsum", "1 baba loo too"]),
# ("2", ["2 hello goodbye seeya"]),
# ("3", ["3 kobe magic wilt", 3 foo sneaks bar", "3 more stuff", "3 last line in file"]) ]
# since you only want the values there, just pull them out of the tuples
result = [v for k,v in groups]
但是,老实说,我不确定groupby
是否一次消耗了所有数据。如果它是惰性迭代器,则可以直接传递f
import itertools
import operator
with open('path/to/file') as f:
groups = itertools.groupby(f, key=lambda line: line.split()[0])
for _, group in groups:
result = list(group)
# use this result however you like, but...
# be sure not to leave this block until you've consumed all of
# result, or you won't be able to read any more of the file.
如果您不能或不想同时将文件读入内存,则必须执行一些特殊的操作
def group_by_col(filename, key=None):
if key is None:
key = lambda s: s
with open(filename) as f:
cur_group = []
grouper = []
for line in file:
new_grouper = key(line)
if new_grouper != grouper:
if cur_group:
yield cur_group
cur_group = [line]
grouper = new_grouper
else:
cur_group.append(line.rstrip())
yield cur_group
在这种情况下,您必须传递key函数来选择每行的第一个空格分隔列:例如lambda s:s.split()[0]
for group in group_by_col('path/to/file', key=lambda s: s.split()[0]):
print(group)
这就是itertools.groupby
的作用,不过我认为您需要将整个文件读入内存才能做到这一点
import itertools
with open("path/to/file") as f:
data = f.readlines() # a list of the lines of the file
groups = itertools.groupby(data, key=lambda line: line.split()[0])
# group on the first column of each line. This produces something like:
# [ ("1", ["1 foo bar", "1 lorem ipsum gypsum", "1 baba loo too"]),
# ("2", ["2 hello goodbye seeya"]),
# ("3", ["3 kobe magic wilt", 3 foo sneaks bar", "3 more stuff", "3 last line in file"]) ]
# since you only want the values there, just pull them out of the tuples
result = [v for k,v in groups]
但是,老实说,我不确定groupby
是否一次消耗了所有数据。如果它是惰性迭代器,则可以直接传递f
import itertools
import operator
with open('path/to/file') as f:
groups = itertools.groupby(f, key=lambda line: line.split()[0])
for _, group in groups:
result = list(group)
# use this result however you like, but...
# be sure not to leave this block until you've consumed all of
# result, or you won't be able to read any more of the file.
如果您不能或不想同时将文件读入内存,则必须执行一些特殊的操作
def group_by_col(filename, key=None):
if key is None:
key = lambda s: s
with open(filename) as f:
cur_group = []
grouper = []
for line in file:
new_grouper = key(line)
if new_grouper != grouper:
if cur_group:
yield cur_group
cur_group = [line]
grouper = new_grouper
else:
cur_group.append(line.rstrip())
yield cur_group
在这种情况下,您必须传递key函数来选择每行的第一个空格分隔列:例如lambda s:s.split()[0]
for group in group_by_col('path/to/file', key=lambda s: s.split()[0]):
print(group)
这是建立在已接受答案的基础上的,并将按任何指定列分组:
def group_by_column(f, column):
line = next(f)
k = line.split()[column]
group = [line]
for line in f:
if line.split()[column] == k:
group.append(line)
else:
yield group
group = [line]
k = line.split()[column]
yield group
if __name__ == "__main__":
foo = "foo.txt"
with open(foo) as foofile:
for group in group_by_column(foofile, 0):
print(group)
这是建立在已接受答案的基础上的,并将按任何指定列分组:
def group_by_column(f, column):
line = next(f)
k = line.split()[column]
group = [line]
for line in f:
if line.split()[column] == k:
group.append(line)
else:
yield group
group = [line]
k = line.split()[column]
yield group
if __name__ == "__main__":
foo = "foo.txt"
with open(foo) as foofile:
for group in group_by_column(foofile, 0):
print(group)
不,您不必将整个文件读入内存。它groupby
工作缓慢,文件处理程序是一个懒惰的迭代器,因此您只需要每个组的内存开销。@touchmyboom-juanpa说我不需要这样做。我可以将f
直接传递到groupby
。我没有时间去测试,但我相信这可能是真的!我的编辑基本上只是重新实现了groupby
,但忽略了分组头,它是结果中每个元组的第一个元素。不管怎样(当然,groupby
更干净),如果第一列中的值可以是任意位置,即第0行和第100行都有1
,但是中间的行都不同,则必须将整个文件读入内存或进行多次传递,我suppose@juanpa.arrivillaga正确的。OP在问题中确定了他的文件是正确排序的(按第1列分组),现在您已经说了,我99%确定您是对的,但是由于我没有时间测试它,文档也没有明确地说出来,我将保持我的重新实现原样。不过,我会在上半部分做一个记录,以确定疑问。这个示例有一个示例实现(当然不是实际的实现),这意味着它实际上是懒惰的。此外,它还警告“返回的组本身是一个迭代器,它与groupby()共享基础iterable。因为源是共享的,所以当groupby()对象处于高级状态时,上一个组将不再可见。”总之,该库是由Raymond Hettinger编写的,它旨在提供惰性、内存高效的迭代器。不是,您不必将整个文件读入内存。它groupby
工作缓慢,文件处理程序是一个懒惰的迭代器,因此您只需要每个组的内存开销。@touchmyboom-juanpa说我不需要这样做。我可以将f
直接传递到groupby
。我没有时间去测试,但我相信这可能是真的!我的编辑基本上只是重新实现了groupby
,但忽略了分组头,它是结果中每个元组的第一个元素。不管怎样(当然,groupby
更干净)这都应该行得通