Python-提取句子-按1关闭
我想提取一段中的句子并逐行打印出来。它做得很好,除非一个句号后面有一个换行符。例如:用户完成一个句子,然后点击回车键。因此,在周期之后没有空间 代码认为这句话是前一句的一部分,因为没有空格。打印出来后,它们粘在一起。换句话说,当句点后没有空格时,我如何修改代码以提取句子。例如:This.should.be.consummed.five.句子,因为有五个句点,但代码仅将其视为一个句子 代码如下:Python-提取句子-按1关闭,python,regex,python-2.7,python-3.x,Python,Regex,Python 2.7,Python 3.x,我想提取一段中的句子并逐行打印出来。它做得很好,除非一个句号后面有一个换行符。例如:用户完成一个句子,然后点击回车键。因此,在周期之后没有空间 代码认为这句话是前一句的一部分,因为没有空格。打印出来后,它们粘在一起。换句话说,当句点后没有空格时,我如何修改代码以提取句子。例如:This.should.be.consummed.five.句子,因为有五个句点,但代码仅将其视为一个句子 代码如下: import re abbreviations = {'dr.': 'doctor', 'mr.':
import re
abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']
def find_sentences(paragraph):
end = True
sentences = []
while end > -1:
end = find_sentence_end(paragraph)
if end > -1:
sentences.append(paragraph[end:].strip())
paragraph = paragraph[:end]
sentences.append(paragraph)
sentences.reverse()
return sentences
def find_sentence_end(paragraph):
[possible_endings, contraction_locations] = [[], []]
contractions = abbreviations.keys()
sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
for sentence_terminator in sentence_terminators:
t_indices = list(find_all(paragraph, sentence_terminator))
possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
for contraction in contractions:
c_indices = list(find_all(paragraph, contraction))
contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
max_end_start = max([pe[0] for pe in possible_endings])
possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
end = (-1 if not len(possible_endings) else max(possible_endings))
return end
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1:
return
yield start
start += len(sub)
在“.”上拆分字符串很简单,但我怀疑您是否希望将“123.333.0.0”视为4个句子。我认为您只需要将“\n”与“”一样对待。这可以通过使用“\s”空格而不是“”来轻松完成。为什么要将This.should.be.five.句子视为五句而不是六句?即使没有句号,代码也会将段落结尾作为一个句子显式处理。你想去掉那个把手吗?