使用特殊分隔符行将文本文件拆分为节-python_Python_File_Delimiter_Yield

使用特殊分隔符行将文本文件拆分为节-python

python file

使用特殊分隔符行将文本文件拆分为节-python,python,file,delimiter,yield,Python,File,Delimiter,Yield,我有一个这样的输入文件： This is a text block start This is the end And this is another with more than one line and another line. 所需的任务是按某些特殊行分隔的部分读取文件，在本例中为空行，例如[out]： [['This is a text block start', 'This is the end'], ['And this is another','with more than

我有一个这样的输入文件：

This is a text block start
This is the end

And this is another
with more than one line
and another line.

所需的任务是按某些特殊行分隔的部分读取文件，在本例中为空行，例如[out]：

[['This is a text block start', 'This is the end'],
['And this is another','with more than one line', 'and another line.']]

通过这样做，我得到了期望的输出：

def per_section(it):
    """ Read a file and yield sections using empty line as delimiter """
    section = []
    for line in it:
        if line.strip('\n'):
            section.append(line)
        else:
            yield ''.join(section)
            section = []
    # yield any remaining lines as a section too
    if section:
        yield ''.join(section)

但是如果特殊行是以

开头的行，例如：

# Some comments, maybe the title of the following section
This is a text block start
This is the end
# Some other comments and also the title
And this is another
with more than one line
and another line.

我必须这样做：

def per_section(it):
    """ Read a file and yield sections using empty line as delimiter """
    section = []
    for line in it:
        if line[0] != "#":
            section.append(line)
        else:
            yield ''.join(section)
            section = []
    # yield any remaining lines as a section too
    if section:
        yield ''.join(section)

如果允许

per_section（）

具有分隔符参数，我可以尝试以下方法：

def per_section(it, delimiter== '\n'):
    """ Read a file and yield sections using empty line as delimiter """
    section = []
    for line in it:
        if line.strip('\n') and delimiter == '\n':
            section.append(line)
        elif delimiter= '\#' and line[0] != "#":
            section.append(line)
        else:
            yield ''.join(section)
            section = []
    # yield any remaining lines as a section too
    if section:
        yield ''.join(section)

但是有没有一种方法可以让我不用硬编码所有可能的分隔符呢？

传递一个谓词怎么样

def per_section(it, is_delimiter=lambda x: x.isspace()):
    ret = []
    for line in it:
        if is_delimiter(line):
            if ret:
                yield ret  # OR  ''.join(ret)
                ret = []
        else:
            ret.append(line.rstrip())  # OR  ret.append(line)
    if ret:
        yield ret

用法：

with open('/path/to/file.txt') as f:
    sections = list(per_section(f))  # default delimiter

with open('/path/to/file.txt.txt') as f:
    sections = list(per_section(f, lambda line: line.startswith('#'))) # comment

像这样的怎么样

from itertools import groupby

def per_section(s, delimiters=()):
    def key(s):
        return not s or s.isspace() or any(s.startswith(x) for x in delimiters)
    for k, g in groupby(s.splitlines(), key=key):
        if not k:
            yield list(g)


if __name__ == '__main__':
    print list(per_section('''This is a text block start
This is the end

And this is another
with more than one line
and another line.'''))

    print list(per_section('''# Some comments, maybe the title of the following section
This is a text block start
This is the end
# Some other comments and also the title
And this is another
with more than one line
and another line.''', ('#')))

print list(per_section('''!! Some comments, maybe the title of the following section
This is a text block start
This is the end
$$ Some other comments and also the title
And this is another
with more than one line
and another line.''', ('!', '$')))

输出：

[['This is a text block start', 'This is the end'], ['And this is another', 'with more than one line', 'and another line.']]
[['This is a text block start', 'This is the end'], ['And this is another', 'with more than one line', 'and another line.']]
[['This is a text block start', 'This is the end'], ['And this is another', 'with more than one line', 'and another line.']]

只需这样做：

with open('yorfileaname.txt') as f: #open desired file
    data = f.read() #read the whole file and save to variable data
    print(*(data.split('=========='))) #now split data when "=.." and print it 
    #usually it would ouput a list but if you use * it will print as string

输出：

content content
more content
content conclusion

content again
more of it
content conclusion

content
content
contend done

为什么不直接作为参数传递而不是硬编码呢？顺便说一句，@falsetru的

per_section（）

已添加到=）通常最好添加一个关于代码功能的解释。这让新开发人员能够理解coed是如何工作的。你是对的，所以我已经在代码中解释了所有内容。如果文件可能很大，请不要这样做。