Python 如何在块中迭代文件？_Python

Python 如何在块中迭代文件？

python

Python 如何在块中迭代文件？,python,Python,我有一个文件（foo.txt），其排序如下（列0被分组）：如何在line.split（）[0]的块中迭代文件？我知道发电机可以做到这一点，但我不完全确定如何做到。基本上，我想这样做： def first_column_grouping(file): yield some_list ## How? with open("foo.txt") as file: for group in first_column_grouping(file): ## 3 values

我有一个文件（foo.txt），其排序如下（列0被分组）：

如何在

line.split（）[0]

的块中迭代文件？我知道发电机可以做到这一点，但我不完全确定如何做到。基本上，我想这样做：

def first_column_grouping(file):
    yield some_list ## How?

with open("foo.txt") as file:
    for group in first_column_grouping(file): ## 3 values
        print group

预期产出：

["1 foo bar", "1 lorem ipsum gypsum", "1 baba loo too"]
["2 hello goodbye seeya"]
["3 kobe magic wilt", 3 foo sneaks bar", "3 more stuff", "3 last line in file"]

因此，实际上您需要由

itertools.groupby

提供的功能。如果您的第一列已排序，则此操作将起作用：

>>> from itertools import groupby
>>> from operator import itemgetter
>>> with io.StringIO(s) as f:
...     for k, g in groupby(f, itemgetter(0)):
...         print(list(g))
...
['1  foo     bar\n', '1  lorem   ipsum   gypsum\n', '1  baba    loo     too\n']
['2  hello   goodbye seeya\n']
['3  kobe    magic   wilt\n', '3  foo     sneaks  bar\n', '3  more    stuff\n', '3  last    line    in      file']
>>>

如果您想稍微清理一下输出，可以将

str.split

映射到您的组中：

>>> with io.StringIO(s) as f:
...     for k, g in groupby(f, itemgetter(0)):
...         print(list(map(str.strip, g)))
...
['1  foo     bar', '1  lorem   ipsum   gypsum', '1  baba    loo     too']
['2  hello   goodbye seeya']
['3  kobe    magic   wilt', '3  foo     sneaks  bar', '3  more    stuff', '3  last    line    in      file']

如果您想从头开始实现这一点，那么一个不灵活、幼稚的生成器可能看起来像这样：

>>> def groupby_first_column(f):
...     line = next(f)
...     k = line[0]
...     group = [line]
...     for line in f:
...         if line[0] == k:
...             group.append(line)
...         else:
...             yield group
...             group = [line]
...             k = line[0]
...     yield group
...
>>> with io.StringIO(s) as f:
...     for group in groupby_first_column(f):
...         print(list(group))
...
['1  foo     bar\n', '1  lorem   ipsum   gypsum\n', '1  baba    loo     too\n']
['2  hello   goodbye seeya\n']
['3  kobe    magic   wilt\n', '3  foo     sneaks  bar\n', '3  more    stuff\n', '3  last    line    in      file']
>>>

警告仅当每行的第一列正好位于第一个位置且长度仅为1个字符时，上述生成器才起作用。这并不意味着非常有用，只是为了说明这个想法。如果你想自己动手，你必须更加彻底，因此，你实际上想要的是

itertools.groupby提供的功能。如果您的第一列已排序，则此操作将起作用：
>>> from itertools import groupby
>>> from operator import itemgetter
>>> with io.StringIO(s) as f:
...     for k, g in groupby(f, itemgetter(0)):
...         print(list(g))
...
['1  foo     bar\n', '1  lorem   ipsum   gypsum\n', '1  baba    loo     too\n']
['2  hello   goodbye seeya\n']
['3  kobe    magic   wilt\n', '3  foo     sneaks  bar\n', '3  more    stuff\n', '3  last    line    in      file']
>>>

如果您想稍微清理一下输出，可以将str.split
映射到您的组中：
>>> with io.StringIO(s) as f:
...     for k, g in groupby(f, itemgetter(0)):
...         print(list(map(str.strip, g)))
...
['1  foo     bar', '1  lorem   ipsum   gypsum', '1  baba    loo     too']
['2  hello   goodbye seeya']
['3  kobe    magic   wilt', '3  foo     sneaks  bar', '3  more    stuff', '3  last    line    in      file']

如果您想从头开始实现这一点，那么一个不灵活、幼稚的生成器可能看起来像这样：
>>> def groupby_first_column(f):
...     line = next(f)
...     k = line[0]
...     group = [line]
...     for line in f:
...         if line[0] == k:
...             group.append(line)
...         else:
...             yield group
...             group = [line]
...             k = line[0]
...     yield group
...
>>> with io.StringIO(s) as f:
...     for group in groupby_first_column(f):
...         print(list(group))
...
['1  foo     bar\n', '1  lorem   ipsum   gypsum\n', '1  baba    loo     too\n']
['2  hello   goodbye seeya\n']
['3  kobe    magic   wilt\n', '3  foo     sneaks  bar\n', '3  more    stuff\n', '3  last    line    in      file']
>>>

警告仅当每行的第一列正好位于第一个位置且长度仅为1个字符时，上述生成器才起作用。这并不意味着非常有用，只是为了说明这个想法。如果你想自己滚动，你必须更彻底这是一个变体（这里是你的文件，在with
语句中）：
这将迭代文件，不需要将整个文件保存在内存中。因此，仅对相邻的线进行分组
（您似乎正在使用python2:文件
不是一个好的变量名-因为它是内置的）
这是一个变体（伪文件
这里只是语句中的文件
）：
这将迭代文件，不需要将整个文件保存在内存中。因此，仅对相邻的线进行分组
（您似乎正在使用python2:文件
不是一个好的变量名，因为它是内置的）
这就是itertools.groupby的作用，不过我认为您需要将整个文件读入内存才能做到这一点
import itertools

with open("path/to/file") as f:
    data = f.readlines()  # a list of the lines of the file

groups = itertools.groupby(data, key=lambda line: line.split()[0])
# group on the first column of each line. This produces something like:
# [ ("1", ["1 foo bar", "1 lorem ipsum gypsum", "1 baba loo too"]),
#   ("2", ["2 hello goodbye seeya"]),
#   ("3", ["3 kobe magic wilt", 3 foo sneaks bar", "3 more stuff", "3 last line in file"]) ]

# since you only want the values there, just pull them out of the tuples
result = [v for k,v in groups]

但是，老实说，我不确定groupby
是否一次消耗了所有数据。如果它是惰性迭代器，则可以直接传递f

import itertools
import operator

with open('path/to/file') as f:
    groups = itertools.groupby(f, key=lambda line: line.split()[0])
    for _, group in groups:
        result = list(group)
        # use this result however you like, but...
    # be sure not to leave this block until you've consumed all of
    # result, or you won't be able to read any more of the file.


如果您不能或不想同时将文件读入内存，则必须执行一些特殊的操作
def group_by_col(filename, key=None):
    if key is None:
        key = lambda s: s
    with open(filename) as f:
        cur_group = []
        grouper = []
        for line in file:
            new_grouper = key(line)
            if new_grouper != grouper:
                if cur_group:
                    yield cur_group
                cur_group = [line]
                grouper = new_grouper
            else:
                cur_group.append(line.rstrip())
        yield cur_group

在这种情况下，您必须传递key函数来选择每行的第一个空格分隔列：例如lambda s:s.split（）[0]

for group in group_by_col('path/to/file', key=lambda s: s.split()[0]):
    print(group)

这就是itertools.groupby
的作用，不过我认为您需要将整个文件读入内存才能做到这一点
import itertools

with open("path/to/file") as f:
    data = f.readlines()  # a list of the lines of the file

groups = itertools.groupby(data, key=lambda line: line.split()[0])
# group on the first column of each line. This produces something like:
# [ ("1", ["1 foo bar", "1 lorem ipsum gypsum", "1 baba loo too"]),
#   ("2", ["2 hello goodbye seeya"]),
#   ("3", ["3 kobe magic wilt", 3 foo sneaks bar", "3 more stuff", "3 last line in file"]) ]

# since you only want the values there, just pull them out of the tuples
result = [v for k,v in groups]

但是，老实说，我不确定groupby
是否一次消耗了所有数据。如果它是惰性迭代器，则可以直接传递f

import itertools
import operator

with open('path/to/file') as f:
    groups = itertools.groupby(f, key=lambda line: line.split()[0])
    for _, group in groups:
        result = list(group)
        # use this result however you like, but...
    # be sure not to leave this block until you've consumed all of
    # result, or you won't be able to read any more of the file.


如果您不能或不想同时将文件读入内存，则必须执行一些特殊的操作
def group_by_col(filename, key=None):
    if key is None:
        key = lambda s: s
    with open(filename) as f:
        cur_group = []
        grouper = []
        for line in file:
            new_grouper = key(line)
            if new_grouper != grouper:
                if cur_group:
                    yield cur_group
                cur_group = [line]
                grouper = new_grouper
            else:
                cur_group.append(line.rstrip())
        yield cur_group

在这种情况下，您必须传递key函数来选择每行的第一个空格分隔列：例如lambda s:s.split（）[0]

for group in group_by_col('path/to/file', key=lambda s: s.split()[0]):
    print(group)

这是建立在已接受答案的基础上的，并将按任何指定列分组：
def group_by_column(f, column):
     line = next(f)
     k = line.split()[column]
     group = [line]
     for line in f:
         if line.split()[column] == k:
             group.append(line)
         else:
             yield group
             group = [line]
             k = line.split()[column]
     yield group


if __name__ == "__main__":

    foo = "foo.txt"
    with open(foo) as foofile:
        for group in group_by_column(foofile, 0):
            print(group)

这是建立在已接受答案的基础上的，并将按任何指定列分组：
def group_by_column(f, column):
     line = next(f)
     k = line.split()[column]
     group = [line]
     for line in f:
         if line.split()[column] == k:
             group.append(line)
         else:
             yield group
             group = [line]
             k = line.split()[column]
     yield group


if __name__ == "__main__":

    foo = "foo.txt"
    with open(foo) as foofile:
        for group in group_by_column(foofile, 0):
            print(group)

不，您不必将整个文件读入内存。它groupby
工作缓慢，文件处理程序是一个懒惰的迭代器，因此您只需要每个组的内存开销。@touchmyboom-juanpa说我不需要这样做。我可以将f
直接传递到groupby
。我没有时间去测试，但我相信这可能是真的！我的编辑基本上只是重新实现了groupby
，但忽略了分组头，它是结果中每个元组的第一个元素。不管怎样（当然，groupby
更干净），如果第一列中的值可以是任意位置，即第0行和第100行都有1
，但是中间的行都不同，则必须将整个文件读入内存或进行多次传递，我suppose@juanpa.arrivillaga正确的。OP在问题中确定了他的文件是正确排序的（按第1列分组），现在您已经说了，我99%确定您是对的，但是由于我没有时间测试它，文档也没有明确地说出来，我将保持我的重新实现原样。不过，我会在上半部分做一个记录，以确定疑问。这个示例有一个示例实现（当然不是实际的实现），这意味着它实际上是懒惰的。此外，它还警告“返回的组本身是一个迭代器，它与groupby（）共享基础iterable。因为源是共享的，所以当groupby（）对象处于高级状态时，上一个组将不再可见。”总之，该库是由Raymond Hettinger编写的，它旨在提供惰性、内存高效的迭代器。不是，您不必将整个文件读入内存。它groupby
工作缓慢，文件处理程序是一个懒惰的迭代器，因此您只需要每个组的内存开销。@touchmyboom-juanpa说我不需要这样做。我可以将f
直接传递到groupby
。我没有时间去测试，但我相信这可能是真的！我的编辑基本上只是重新实现了groupby
，但忽略了分组头，它是结果中每个元组的第一个元素。不管怎样（当然，groupby
更干净）这都应该行得通