Python 将行拆分为段落_Python

Python 将行拆分为段落

python

Python 将行拆分为段落,python,Python,输入：行的列表输出：行列表，是在（一个或多个）空行处拆分的输入列表这是迄今为止我所拥有的最不丑陋的解决方案： split_at_empty(lines): paragraphs = [] p = [] def flush(): if p: paragraphs.append(p) p = [] for l in lines: if l: p.append(l)

输入：行的列表

输出：行列表，是在（一个或多个）空行处拆分的输入列表

这是迄今为止我所拥有的最不丑陋的解决方案：

split_at_empty(lines):
    paragraphs = []
    p = []
    def flush():
        if p:
            paragraphs.append(p)
        p = []
    for l in lines:
        if l:
            p.append(l)
        else:
            flush()
    flush()
    return paragraphs

必须有更好的解决方案（甚至可能是功能性的）！有人吗

输入列表示例：

['','2','3','','5','6','7','8','','','11']

输出：

[['2','3'],['5','6','7','8'],['11']]

您可以将列表合并为字符串，然后重新拆分：

>>> a = ['', '2', '3', '', '5', '6', '7', '8', '', '', '11']
>>> [x.strip().split(' ') for x in ' '.join(a).split('  ')]
[['2', '3'], ['5', '6', '7', '8'], ['11']]

您可能应该使用正则表达式来捕获任意数量的空格（我在'11'之前添加了另一个）：

您可以将列表合并为字符串，然后重新拆分：

>>> a = ['', '2', '3', '', '5', '6', '7', '8', '', '', '11']
>>> [x.strip().split(' ') for x in ' '.join(a).split('  ')]
[['2', '3'], ['5', '6', '7', '8'], ['11']]

您可能应该使用正则表达式来捕获任意数量的空格（我在'11'之前添加了另一个）：

以下是基于生成器的解决方案：

def split_at_empty(lines):
   sep = [0] + [i for (i,l) in enumerate(lines) if not l] + [len(lines)]
   for start, end in zip(sep[:-1], sep[1:]):
      if start + 1 < end:
         yield lines[start+1:end]

它产生

['2', '3']
['5', '6', '7', '8']
['11']

以下是基于生成器的解决方案：

def split_at_empty(lines):
   sep = [0] + [i for (i,l) in enumerate(lines) if not l] + [len(lines)]
   for start, end in zip(sep[:-1], sep[1:]):
      if start + 1 < end:
         yield lines[start+1:end]

它产生

['2', '3']
['5', '6', '7', '8']
['11']

结果

['Princess Maria Amelia of Brazil (1831\x961853)']

['was the daughter of Dom Pedro I,', "founder of Brazil's independence and its first emperor,"]

['and Amelie of Leuchtenberg.']

["The only child from her father's second marriage,", 'Maria Amelia was born in France', "following Pedro I's 1831 abdication in favor of his son Dom Pedro II."]

['Before Maria Amelia was a month old, Pedro I left for Portugal', 'to restore its crown to his eldest daughter Dona Maria II.', "He defeated his brother Miguel I (who had usurped Maria II's throne),", 'only to die a few months later of tuberculosis.']

[['2', '3'], ['5', '6', '7', '8'], ['11']]

['', '2', '3', '', '5', '6', '7', '8', '', '', '11']
[['2', '3'], ['5', '6', '7', '8'], ['11']]

['5055', '', '', '2', '54', '87', '', '1', '2', '5', '8', '', '']
[['5055'], ['2', '54', '87'], ['1', '2', '5', '8']]

['AAAAA', 'BB', '', 'HU', 'JU', 'GU']
[['AAAAA', 'BB'], ['HU', 'JU', 'GU']]

另一种方法是按列表行事：

li = [ '', '2', '3', '', '5', '6', '7', '8', '', '', '11']

lo = ['5055','','','2','54','87','','1','2','5','8','','']

lu = ['AAAAA','BB','','HU','JU','GU']

def selines(L):
    ye = []
    for x in L:
        if x:
            ye.append(x)
        elif ye:
            yield ye ; ye = []
    if ye:
        yield ye



for lx in (li,lo,lu):
    print lx
    print list(selines(lx))
    print

结果

['Princess Maria Amelia of Brazil (1831\x961853)']

['was the daughter of Dom Pedro I,', "founder of Brazil's independence and its first emperor,"]

['and Amelie of Leuchtenberg.']

["The only child from her father's second marriage,", 'Maria Amelia was born in France', "following Pedro I's 1831 abdication in favor of his son Dom Pedro II."]

['Before Maria Amelia was a month old, Pedro I left for Portugal', 'to restore its crown to his eldest daughter Dona Maria II.', "He defeated his brother Miguel I (who had usurped Maria II's throne),", 'only to die a few months later of tuberculosis.']

[['2', '3'], ['5', '6', '7', '8'], ['11']]

['', '2', '3', '', '5', '6', '7', '8', '', '', '11']
[['2', '3'], ['5', '6', '7', '8'], ['11']]

['5055', '', '', '2', '54', '87', '', '1', '2', '5', '8', '', '']
[['5055'], ['2', '54', '87'], ['1', '2', '5', '8']]

['AAAAA', 'BB', '', 'HU', 'JU', 'GU']
[['AAAAA', 'BB'], ['HU', 'JU', 'GU']]

结果

['Princess Maria Amelia of Brazil (1831\x961853)']

['was the daughter of Dom Pedro I,', "founder of Brazil's independence and its first emperor,"]

['and Amelie of Leuchtenberg.']

["The only child from her father's second marriage,", 'Maria Amelia was born in France', "following Pedro I's 1831 abdication in favor of his son Dom Pedro II."]

['Before Maria Amelia was a month old, Pedro I left for Portugal', 'to restore its crown to his eldest daughter Dona Maria II.', "He defeated his brother Miguel I (who had usurped Maria II's throne),", 'only to die a few months later of tuberculosis.']

[['2', '3'], ['5', '6', '7', '8'], ['11']]

['', '2', '3', '', '5', '6', '7', '8', '', '', '11']
[['2', '3'], ['5', '6', '7', '8'], ['11']]

['5055', '', '', '2', '54', '87', '', '1', '2', '5', '8', '', '']
[['5055'], ['2', '54', '87'], ['1', '2', '5', '8']]

['AAAAA', 'BB', '', 'HU', 'JU', 'GU']
[['AAAAA', 'BB'], ['HU', 'JU', 'GU']]

另一种方法是按列表行事：

li = [ '', '2', '3', '', '5', '6', '7', '8', '', '', '11']

lo = ['5055','','','2','54','87','','1','2','5','8','','']

lu = ['AAAAA','BB','','HU','JU','GU']

def selines(L):
    ye = []
    for x in L:
        if x:
            ye.append(x)
        elif ye:
            yield ye ; ye = []
    if ye:
        yield ye



for lx in (li,lo,lu):
    print lx
    print list(selines(lx))
    print

结果

['Princess Maria Amelia of Brazil (1831\x961853)']

['was the daughter of Dom Pedro I,', "founder of Brazil's independence and its first emperor,"]

['and Amelie of Leuchtenberg.']

["The only child from her father's second marriage,", 'Maria Amelia was born in France', "following Pedro I's 1831 abdication in favor of his son Dom Pedro II."]

['Before Maria Amelia was a month old, Pedro I left for Portugal', 'to restore its crown to his eldest daughter Dona Maria II.', "He defeated his brother Miguel I (who had usurped Maria II's throne),", 'only to die a few months later of tuberculosis.']

[['2', '3'], ['5', '6', '7', '8'], ['11']]

['', '2', '3', '', '5', '6', '7', '8', '', '', '11']
[['2', '3'], ['5', '6', '7', '8'], ['11']]

['5055', '', '', '2', '54', '87', '', '1', '2', '5', '8', '', '']
[['5055'], ['2', '54', '87'], ['1', '2', '5', '8']]

['AAAAA', 'BB', '', 'HU', 'JU', 'GU']
[['AAAAA', 'BB'], ['HU', 'JU', 'GU']]

比原版稍微丑一点：

def split_at_empty(lines):
    r = [[]]
    for l in lines:
        if l:
            r[-1].append(l)
        else:
            r.append([])
    return [l for l in r if l]

（最后一行去掉了原本要添加的空列表。）

比原来的略不难看：

def split_at_empty(lines):
    r = [[]]
    for l in lines:
        if l:
            r[-1].append(l)
        else:
            r.append([])
    return [l for l in r if l]

（最后一行删除了原本会添加的空列表。）

对于列表理解痴迷者

def split_at_empty(L):
    return [L[start:end+1] for start, end in zip(
        [n for n in xrange(len(L)) if L[n] and (n == 0 or not L[n-1])],
        [n for n in xrange(len(L)) if L[n] and (n+1 == len(L) or not L[n+1])]
        )]

或者更好

def split_at_empty(lines):
    L = [i for i, a in enumerate(lines) if not a]
    return [lines[s + 1:e] for s, e in zip([-1] + L, L + [len(lines)]) 
            if e > s + 1]

而对于那些痴迷于理解的人

def split_at_empty(L):
    return [L[start:end+1] for start, end in zip(
        [n for n in xrange(len(L)) if L[n] and (n == 0 or not L[n-1])],
        [n for n in xrange(len(L)) if L[n] and (n+1 == len(L) or not L[n+1])]
        )]

或者更好

def split_at_empty(lines):
    L = [i for i, a in enumerate(lines) if not a]
    return [lines[s + 1:e] for s, e in zip([-1] + L, L + [len(lines)]) 
            if e > s + 1]

发布输入列表的示例。@Jo因此您的“解决方案”不起作用：

flush（）

中的局部

负责

UnboundLocalError:赋值前引用的局部变量“p”

。那不是serious@eyquem. 我的错。太过沉迷于JavaScript了。要让它工作，我们必须让它更难看一点。@Jo好吧，你是一个很好的人张贴了你的输入列表样本。@Jo所以你的“解决方案”不起作用：local

pinflush（）
负责UnboundLocalError:赋值前引用的局部变量“p
。那不是serious@eyquem. 我的错。太过沉迷于JavaScript了。要让它工作，我们必须让它更难看一点。@Jo好吧，你是个不错的人我想到过这个，但你不觉得它有点复杂，有太多的开销和长的线路，唯一的好处是线路少？我想到过这个，但是你不觉得它有点复杂，而且有太多的开销和很长的线路，唯一的好处是线路少？也许这是最好的一个！使用发电机可以让它更干净，谢谢。现在，这是一个被接受的答案。请把它编辑得简洁明了，你会得到回复：）@Jo你好，我仔细考虑了你的评论。我在两天前更正了上述代码。今天我还纠正了我的其他答案。因为你是对的，我倾向于写太长的答案。我甚至删除了我刚才在这里写的评论，没有兴趣用我这些无用的评论来干扰stackoverflow的记忆。谢谢你指出我的缺点，我会提醒你，也许这是最好的！使用发电机可以让它更干净，谢谢。现在，这是一个被接受的答案。请把它编辑得简洁明了，你会得到回复：）@Jo你好，我仔细考虑了你的评论。我在两天前更正了上述代码。今天我还纠正了我的其他答案。因为你是对的，我倾向于写太长的答案。我甚至删除了我刚才在这里写的评论，没有兴趣用我这些无用的评论来干扰stackoverflow的记忆。谢谢你指出了我的缺点，我会提醒你，还不错，真的！开销还可以。我比较喜欢这个简单。唯一的问题是如果输入列表很大。真的不错！开销还可以。我比较喜欢这个简单。唯一的问题是输入列表是否庞大。不幸的是，第一个列表是错误的。第二个很好，但不能很好地使用列表生成器。这两个都适用于原始示例输入（和其他输入）。您使用的是什么输入列表？对不起，您是对的，但是使用不同索引的L
和行有点混乱。而且，与L[n]！=“相比，L[n]更容易阅读，而不是L[n+1]
”而L[n+1]=''
（也许你想把我的更改排除在外）不幸的是，第一个错误。第二个很好，但不能很好地使用列表生成器。这两个都适用于原始示例输入（和其他输入）。您使用的是什么输入列表？对不起，您是对的，但是使用不同索引的L
和行有点混乱。而且，与L[n]！=“相比，L[n]更容易阅读，而不是L[n+1]
”和L[n+1]=''
（也许你想删除我的更改）