使用包含缩写的正则表达式拆分Python中的段落_Python_Regex_Split

使用包含缩写的正则表达式拆分Python中的段落

python regex

使用包含缩写的正则表达式拆分Python中的段落,python,regex,split,Python,Regex,Split,尝试在包含3个字符串和缩写的段落上使用此函数 #!/usr/bin/env python # -*- coding: UTF-8 -*- def splitParagraphIntoSentences(paragraph): ''' break a paragraph into sentences and return a list ''' import re # to split by multile characters # regul

尝试在包含3个字符串和缩写的段落上使用此函数

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?][\s]{1,2}[A-Z]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList

if __name__ == '__main__':
    p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – is the only mango tree. Commonly cultivated in many tropical and subtropical regions, and its fruit is distributed essentially worldwide.In several cultures, its fruit and leaves are ritually used as floral decorations at weddings, public celebrations and religious "

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

下一个乞讨句的第一个字符被删除

O/p Recieved: While other Mangifera species (e.g. horse mango, M. foetida) are also grown on a more localized basis, Mangifera indica ΓÇô the common mango or Indian mango ΓÇô is the only mango tree ommonly cultivated in many tropical and subtropical regions, and its fruit is di stributed essentially worldwide.In several cultures, its fruit and leaves are ri tually used as floral decorations at weddings, public celebrations and religious. 因此，甚至缩写也会被拆分。

您想要的正则表达式是：

[.!?][\s]{1,2}(?=[A-Z])

您需要一个积极的前瞻断言，这意味着您希望匹配后跟大写字母的模式，但不匹配大写字母

只有第一个匹配的原因是在第二个周期后您没有空格。

请只发布排序、独立、正确的示例->他得到了什么以及他想要得到什么非常清楚。他只是格式不正确。直到你的回答我才明白他想要什么。你怎么了？它不应该是

\s

？我没有过多地编辑它。这两者是等价的。@agf:Thanx回答得太多了。它解决了这个问题：）你能告诉我是否有办法解决这个特殊字符问题吗？试试

sentencenders.split（段落，re.UNICODE）

[.!?][\s]{1,2}(?=[A-Z])