Python 如何逐块读取大文件并根据块头进行判断？_Python

Python 如何逐块读取大文件并根据块头进行判断？

python

Python 如何逐块读取大文件并根据块头进行判断？,python,Python,我有一个大文件，我想通过匹配标题逐块读取它。例如，该文件如下所示： with open("filename") as f: line=f.readline() if line.split()[0]=="@header1": list1.append(f.readline().split()[0]) list2.append(f.readline().split()[1]) ... elif line.split()[0]=="@header2":

我有一个大文件，我想通过匹配标题逐块读取它。例如，该文件如下所示：

with open("filename") as f:
  line=f.readline()
  if line.split()[0]=="@header1":
     list1.append(f.readline().split()[0])
     list2.append(f.readline().split()[1])
     ...
  elif line.split()[0]=="@header2":
     list6.append(f.readline().split()[0])
     list7.append(f.readline().split()[1])
     ...

with open(filename) as f:
    for line in f:
        if line.startswith('@'):
            print('header')
            # do something with header here
        else:
            print('regular line')
            # do something with the line here

@header1
a b c 1 2 3
c d e 2 3 4
q w e 3 4 5
@校长2
e 89 78 56
s 68 77 26
...

我写了这样一个剧本：

with open("filename") as f:
  line=f.readline()
  if line.split()[0]=="@header1":
     list1.append(f.readline().split()[0])
     list2.append(f.readline().split()[1])
     ...
  elif line.split()[0]=="@header2":
     list6.append(f.readline().split()[0])
     list7.append(f.readline().split()[1])
     ...

with open(filename) as f:
    for line in f:
        if line.startswith('@'):
            print('header')
            # do something with header here
        else:
            print('regular line')
            # do something with the line here

但它似乎只读取第一个头，而没有读取第二个块。此外，在这些块之间还有一些空行。如何在行与某些字符串匹配时读取块并跳过那些空行

我知道在C语言中，它是开关。如何在python中执行类似的操作？

我不知道您到底想要实现什么，但可能是这样的：

with open("filename") as f:
  line=f.readline()
  if line.split()[0]=="@header1":
     list1.append(f.readline().split()[0])
     list2.append(f.readline().split()[1])
     ...
  elif line.split()[0]=="@header2":
     list6.append(f.readline().split()[0])
     list7.append(f.readline().split()[1])
     ...

with open(filename) as f:
    for line in f:
        if line.startswith('@'):
            print('header')
            # do something with header here
        else:
            print('regular line')
            # do something with the line here

在我看来，你的误解是如何读取csv文件。至少我怀疑C的“转换”比if条款更有用

但是，请理解，您必须逐行遍历您的文件。也就是说，如果你不知道之前的长度，就没有任何东西可以处理整个区块

所以你的算法是这样的：

对于文件中的每一行：
. .是标题吗？
. . .然后准备此特定标题
. .是空行吗？
. . .然后跳过
. .是数据吗？
. . .然后根据上述准备添加

在代码中，这可能是

block_ctr = -1
block_data = []
with open(filename) as f:
    for line in f:                   
        if line:                         # test if line is not empty
            if line.startswith('@header'):
                block_ctr += 1
                block_data.append([])
            else:
                block_data[block_ctr].append(line.split())

底部附有一个解决方案，它使用Python生成器

将\u分割成\u块（f）

来提取每个部分（作为字符串列表），压制空行，检测丢失的@header和EOF。生成器方法非常简洁，因为它允许您进一步包装，例如处理空格分隔值的CSV读取器对象（例如pandas read_CSV）：

代码如下。我还为您参数化了值

标定器=“@header”

。请注意，我们必须使用

line=inputstream.readline（）

，

进行迭代，而不是通常使用对f
中的行进行迭代，因为如果我们看到下一节的@header，我们需要使用seek/tell（）
；有关原因的解释，请参见和。如果您想修改生成器以分别生成块头和块体（例如，作为两个项的列表），那么这很简单
def split_into_chunks(inputstream, demarcator='@header'):
    """Utility generator to get sections from file, demarcated by '@header'"""

    while True:
        chunk = []

        line = inputstream.readline()
        # At EOF?
        if not line: break
        # Expect that each chunk starts with one header line
        if not line.startswith(demarcator):
            raise RuntimeError(f"Bad chunk, missing {demarcator}")

        chunk.append(line.rstrip('\n'))

        # Can't use `for line in inputstream:` since we may need to pushback
        while line:
            # Remember our file-pointer position in case we need to pushback a header row
            last_pos = inputstream.tell()
            line = inputstream.readline()

            # Saw next chunk's header line? Pushback the header line, then yield the current chunk
            if line.startswith(demarcator):
                inputstream.seek(last_pos)
                break

            # Ignore blank or whitespace-only lines
            #line = line.rstrip('\n')
            if line:
                chunk.append(line.rstrip('\n'))

        yield chunk


with open('your.ssv') as f:
    for chunk in split_into_chunks(f):
        # Do stuff on chunk. Presumably, wrap it with a reader which handles space-sparated value, e.g. pandas read_csv
        print(chunk)

我看到了另一篇类似于这个问题的帖子，并在这里复制了这个想法。我同意SpightTCD是正确的，尽管我没有尝试过
    with open(filename) as f:
        #find each line number that contains header
        for i,line in enumerate(f,1):
            if 'some_header' in line:
                index1=i
            elif 'another_header' in line:
                index2=i
            ...
    with open(filename) as f:
        #read the first block:
        for i in range(int(index1)):
            line=f.readline()
        for i in range('the block size'):
            'read, split and store'
        f.seek(0)
        #read the second block, third and ... 
        ...

你需要添加更多的细节。这些多个空间分隔的文件段是否在一个文件中？@标题…
是否保证按顺序连续编号？如果@header1
完全独立出现，为什么要测试line.split（）[0]==“@header2”
，而不是简单地测试line==“header2”
？或者只是line.startswith（“@header”）
，它应该可以捕获所有内容，甚至不需要正则表达式？最终我希望您希望读取以空格分隔的行内容（在每个节中，根据其标题），因此您需要包装一个reader对象。或者编写一个生成器来分别生成每一行块，这样您就可以将它传递到一个reader对象中。“而且，在这些块之间有一些空行。”所以，您可以保证空行只能出现在节的外部，而不能出现在节的内部？它适合于生成器方法，请参见我的答案