Python-提取句子-按1关闭_Python_Regex_Python 2.7_Python 3.x

Python-提取句子-按1关闭

python regex python-2.7 python-3.x

Python-提取句子-按1关闭,python,regex,python-2.7,python-3.x,Python,Regex,Python 2.7,Python 3.x,我想提取一段中的句子并逐行打印出来。它做得很好，除非一个句号后面有一个换行符。例如：用户完成一个句子，然后点击回车键。因此，在周期之后没有空间代码认为这句话是前一句的一部分，因为没有空格。打印出来后，它们粘在一起。换句话说，当句点后没有空格时，我如何修改代码以提取句子。例如：This.should.be.consummed.five.句子，因为有五个句点，但代码仅将其视为一个句子代码如下： import re abbreviations = {'dr.': 'doctor', 'mr.':

我想提取一段中的句子并逐行打印出来。它做得很好，除非一个句号后面有一个换行符。例如：用户完成一个句子，然后点击回车键。因此，在周期之后没有空间

代码认为这句话是前一句的一部分，因为没有空格。打印出来后，它们粘在一起。换句话说，当句点后没有空格时，我如何修改代码以提取句子。例如：This.should.be.consummed.five.句子，因为有五个句点，但代码仅将其视为一个句子

代码如下：

import re

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences

def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end

def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

在“.”上拆分字符串很简单，但我怀疑您是否希望将“123.333.0.0”视为4个句子。我认为您只需要将“\n”与“”一样对待。这可以通过使用“\s”空格而不是“”来轻松完成。为什么要将This.should.be.five.句子视为五句而不是六句？即使没有句号，代码也会将段落结尾作为一个句子显式处理。你想去掉那个把手吗？