Python 如何在块中迭代文件?

Python 如何在块中迭代文件?,python,Python,我有一个文件(foo.txt),其排序如下(列0被分组): 如何在line.split()[0]的块中迭代文件?我知道发电机可以做到这一点,但我不完全确定如何做到。基本上,我想这样做: def first_column_grouping(file): yield some_list ## How? with open("foo.txt") as file: for group in first_column_grouping(file): ## 3 values

我有一个文件(foo.txt),其排序如下(列0被分组):

如何在
line.split()[0]
的块中迭代文件?我知道发电机可以做到这一点,但我不完全确定如何做到。基本上,我想这样做:

def first_column_grouping(file):
    yield some_list ## How?

with open("foo.txt") as file:
    for group in first_column_grouping(file): ## 3 values
        print group
预期产出:

["1 foo bar", "1 lorem ipsum gypsum", "1 baba loo too"]
["2 hello goodbye seeya"]
["3 kobe magic wilt", 3 foo sneaks bar", "3 more stuff", "3 last line in file"]

因此,实际上您需要由
itertools.groupby
提供的功能。如果您的第一列已排序,则此操作将起作用:

>>> from itertools import groupby
>>> from operator import itemgetter
>>> with io.StringIO(s) as f:
...     for k, g in groupby(f, itemgetter(0)):
...         print(list(g))
...
['1  foo     bar\n', '1  lorem   ipsum   gypsum\n', '1  baba    loo     too\n']
['2  hello   goodbye seeya\n']
['3  kobe    magic   wilt\n', '3  foo     sneaks  bar\n', '3  more    stuff\n', '3  last    line    in      file']
>>>
如果您想稍微清理一下输出,可以将
str.split
映射到您的组中:

>>> with io.StringIO(s) as f:
...     for k, g in groupby(f, itemgetter(0)):
...         print(list(map(str.strip, g)))
...
['1  foo     bar', '1  lorem   ipsum   gypsum', '1  baba    loo     too']
['2  hello   goodbye seeya']
['3  kobe    magic   wilt', '3  foo     sneaks  bar', '3  more    stuff', '3  last    line    in      file']
如果您想从头开始实现这一点,那么一个不灵活、幼稚的生成器可能看起来像这样:

>>> def groupby_first_column(f):
...     line = next(f)
...     k = line[0]
...     group = [line]
...     for line in f:
...         if line[0] == k:
...             group.append(line)
...         else:
...             yield group
...             group = [line]
...             k = line[0]
...     yield group
...
>>> with io.StringIO(s) as f:
...     for group in groupby_first_column(f):
...         print(list(group))
...
['1  foo     bar\n', '1  lorem   ipsum   gypsum\n', '1  baba    loo     too\n']
['2  hello   goodbye seeya\n']
['3  kobe    magic   wilt\n', '3  foo     sneaks  bar\n', '3  more    stuff\n', '3  last    line    in      file']
>>>

警告仅当每行的第一列正好位于第一个位置且长度仅为1个字符时,上述生成器才起作用。这并不意味着非常有用,只是为了说明这个想法。如果你想自己动手,你必须更加彻底,因此,你实际上想要的是
itertools.groupby提供的功能。如果您的第一列已排序,则此操作将起作用:

>>> from itertools import groupby
>>> from operator import itemgetter
>>> with io.StringIO(s) as f:
...     for k, g in groupby(f, itemgetter(0)):
...         print(list(g))
...
['1  foo     bar\n', '1  lorem   ipsum   gypsum\n', '1  baba    loo     too\n']
['2  hello   goodbye seeya\n']
['3  kobe    magic   wilt\n', '3  foo     sneaks  bar\n', '3  more    stuff\n', '3  last    line    in      file']
>>>
如果您想稍微清理一下输出,可以将
str.split
映射到您的组中:

>>> with io.StringIO(s) as f:
...     for k, g in groupby(f, itemgetter(0)):
...         print(list(map(str.strip, g)))
...
['1  foo     bar', '1  lorem   ipsum   gypsum', '1  baba    loo     too']
['2  hello   goodbye seeya']
['3  kobe    magic   wilt', '3  foo     sneaks  bar', '3  more    stuff', '3  last    line    in      file']
如果您想从头开始实现这一点,那么一个不灵活、幼稚的生成器可能看起来像这样:

>>> def groupby_first_column(f):
...     line = next(f)
...     k = line[0]
...     group = [line]
...     for line in f:
...         if line[0] == k:
...             group.append(line)
...         else:
...             yield group
...             group = [line]
...             k = line[0]
...     yield group
...
>>> with io.StringIO(s) as f:
...     for group in groupby_first_column(f):
...         print(list(group))
...
['1  foo     bar\n', '1  lorem   ipsum   gypsum\n', '1  baba    loo     too\n']
['2  hello   goodbye seeya\n']
['3  kobe    magic   wilt\n', '3  foo     sneaks  bar\n', '3  more    stuff\n', '3  last    line    in      file']
>>>
警告仅当每行的第一列正好位于第一个位置且长度仅为1个字符时,上述生成器才起作用。这并不意味着非常有用,只是为了说明这个想法。如果你想自己滚动,你必须更彻底这是一个变体(这里是你的
文件
,在
with
语句中):

这将迭代文件,不需要将整个文件保存在内存中。因此,仅对相邻的线进行分组

(您似乎正在使用python2:
文件
不是一个好的变量名-因为它是内置的)

这是一个变体(
伪文件
这里只是
语句中的
文件
):

这将迭代文件,不需要将整个文件保存在内存中。因此,仅对相邻的线进行分组


(您似乎正在使用python2:
文件
不是一个好的变量名,因为它是内置的)

这就是
itertools.groupby的作用,不过我认为您需要将整个文件读入内存才能做到这一点

import itertools

with open("path/to/file") as f:
    data = f.readlines()  # a list of the lines of the file

groups = itertools.groupby(data, key=lambda line: line.split()[0])
# group on the first column of each line. This produces something like:
# [ ("1", ["1 foo bar", "1 lorem ipsum gypsum", "1 baba loo too"]),
#   ("2", ["2 hello goodbye seeya"]),
#   ("3", ["3 kobe magic wilt", 3 foo sneaks bar", "3 more stuff", "3 last line in file"]) ]

# since you only want the values there, just pull them out of the tuples
result = [v for k,v in groups]
但是,老实说,我不确定
groupby
是否一次消耗了所有数据。如果它是惰性迭代器,则可以直接传递
f

import itertools
import operator

with open('path/to/file') as f:
    groups = itertools.groupby(f, key=lambda line: line.split()[0])
    for _, group in groups:
        result = list(group)
        # use this result however you like, but...
    # be sure not to leave this block until you've consumed all of
    # result, or you won't be able to read any more of the file.

如果您不能或不想同时将文件读入内存,则必须执行一些特殊的操作

def group_by_col(filename, key=None):
    if key is None:
        key = lambda s: s
    with open(filename) as f:
        cur_group = []
        grouper = []
        for line in file:
            new_grouper = key(line)
            if new_grouper != grouper:
                if cur_group:
                    yield cur_group
                cur_group = [line]
                grouper = new_grouper
            else:
                cur_group.append(line.rstrip())
        yield cur_group
在这种情况下,您必须传递key函数来选择每行的第一个空格分隔列:例如
lambda s:s.split()[0]

for group in group_by_col('path/to/file', key=lambda s: s.split()[0]):
    print(group)

这就是
itertools.groupby
的作用,不过我认为您需要将整个文件读入内存才能做到这一点

import itertools

with open("path/to/file") as f:
    data = f.readlines()  # a list of the lines of the file

groups = itertools.groupby(data, key=lambda line: line.split()[0])
# group on the first column of each line. This produces something like:
# [ ("1", ["1 foo bar", "1 lorem ipsum gypsum", "1 baba loo too"]),
#   ("2", ["2 hello goodbye seeya"]),
#   ("3", ["3 kobe magic wilt", 3 foo sneaks bar", "3 more stuff", "3 last line in file"]) ]

# since you only want the values there, just pull them out of the tuples
result = [v for k,v in groups]
但是,老实说,我不确定
groupby
是否一次消耗了所有数据。如果它是惰性迭代器,则可以直接传递
f

import itertools
import operator

with open('path/to/file') as f:
    groups = itertools.groupby(f, key=lambda line: line.split()[0])
    for _, group in groups:
        result = list(group)
        # use this result however you like, but...
    # be sure not to leave this block until you've consumed all of
    # result, or you won't be able to read any more of the file.

如果您不能或不想同时将文件读入内存,则必须执行一些特殊的操作

def group_by_col(filename, key=None):
    if key is None:
        key = lambda s: s
    with open(filename) as f:
        cur_group = []
        grouper = []
        for line in file:
            new_grouper = key(line)
            if new_grouper != grouper:
                if cur_group:
                    yield cur_group
                cur_group = [line]
                grouper = new_grouper
            else:
                cur_group.append(line.rstrip())
        yield cur_group
在这种情况下,您必须传递key函数来选择每行的第一个空格分隔列:例如
lambda s:s.split()[0]

for group in group_by_col('path/to/file', key=lambda s: s.split()[0]):
    print(group)

这是建立在已接受答案的基础上的,并将按任何指定列分组:

def group_by_column(f, column):
     line = next(f)
     k = line.split()[column]
     group = [line]
     for line in f:
         if line.split()[column] == k:
             group.append(line)
         else:
             yield group
             group = [line]
             k = line.split()[column]
     yield group


if __name__ == "__main__":

    foo = "foo.txt"
    with open(foo) as foofile:
        for group in group_by_column(foofile, 0):
            print(group)

这是建立在已接受答案的基础上的,并将按任何指定列分组:

def group_by_column(f, column):
     line = next(f)
     k = line.split()[column]
     group = [line]
     for line in f:
         if line.split()[column] == k:
             group.append(line)
         else:
             yield group
             group = [line]
             k = line.split()[column]
     yield group


if __name__ == "__main__":

    foo = "foo.txt"
    with open(foo) as foofile:
        for group in group_by_column(foofile, 0):
            print(group)

不,您不必将整个文件读入内存。它
groupby
工作缓慢,文件处理程序是一个懒惰的迭代器,因此您只需要每个组的内存开销。@touchmyboom-juanpa说我不需要这样做。我可以将
f
直接传递到
groupby
。我没有时间去测试,但我相信这可能是真的!我的编辑基本上只是重新实现了
groupby
,但忽略了分组头,它是结果中每个元组的第一个元素。不管怎样(当然,
groupby
更干净),如果第一列中的值可以是任意位置,即第0行和第100行都有
1
,但是中间的行都不同,则必须将整个文件读入内存或进行多次传递,我suppose@juanpa.arrivillaga正确的。OP在问题中确定了他的文件是正确排序的(按第1列分组),现在您已经说了,我99%确定您是对的,但是由于我没有时间测试它,文档也没有明确地说出来,我将保持我的重新实现原样。不过,我会在上半部分做一个记录,以确定疑问。这个示例有一个示例实现(当然不是实际的实现),这意味着它实际上是懒惰的。此外,它还警告“返回的组本身是一个迭代器,它与groupby()共享基础iterable。因为源是共享的,所以当groupby()对象处于高级状态时,上一个组将不再可见。”总之,该库是由Raymond Hettinger编写的,它旨在提供惰性、内存高效的迭代器。不是,您不必将整个文件读入内存。它
groupby
工作缓慢,文件处理程序是一个懒惰的迭代器,因此您只需要每个组的内存开销。@touchmyboom-juanpa说我不需要这样做。我可以将
f
直接传递到
groupby
。我没有时间去测试,但我相信这可能是真的!我的编辑基本上只是重新实现了
groupby
,但忽略了分组头,它是结果中每个元组的第一个元素。不管怎样(当然,
groupby
更干净)这都应该行得通