如何在Python中按句子分解段落_Python_Regex_Text Segmentation

如何在Python中按句子分解段落

python regex

如何在Python中按句子分解段落,python,regex,text-segmentation,Python,Regex,Text Segmentation,我需要用Python解析段落中的句子。有没有现成的软件包可以这样做，或者我应该在这里尝试使用regex？下面是我如何获得前n个句子的： def get_first_n_sentence(text, n): endsentence = ".?!" sentences = itertools.groupby(text, lambda x: any(x.endswith(punct) for punct in endsentence)) for number,(truth, s

我需要用Python解析段落中的句子。有没有现成的软件包可以这样做，或者我应该在这里尝试使用regex？

下面是我如何获得前n个句子的：

def get_first_n_sentence(text, n):
    endsentence = ".?!"
    sentences = itertools.groupby(text, lambda x: any(x.endswith(punct) for punct in endsentence))
    for number,(truth, sentence) in enumerate(sentences):
        if truth:
            first_n_sentences = previous+''.join(sentence).replace('\n',' ')
        previous = ''.join(sentence)
        if number>=2*n: break #

    return first_n_sentences

参考资料：

该模块专为此设计，可处理边缘情况。例如：

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."
>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

每个句子结尾后是否有两个空格？你的问题陈述没有提供足够的信息供我们使用。这里有一些答案：“使用regexp的纯语法方法听起来有问题……想想美国的Smith教授告诉我们可以使用句点的5.5种方法。”这些工作通常由专用的句子拆分工具/库模块完成。试图单独使用正则表达式不会产生好的结果。更好的拆分器已经过培训。如果文本包含URL，或在术语上带有标点符号（句点），例如，Ms.，Dr.，等等，则无法使用。对于标记化中的句子，发送标记化（文本）：打印（句子）