Python 试图用正则表达式拆分句子

Python 试图用正则表达式拆分句子,python,regex,Python,Regex,到目前为止,我发现这个正则表达式在我参与的几乎所有竞赛中都很有效 (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=[.?])\s 进入: 不幸的是,它不包括一个案例。 例如,如果我有这样一句话: C. Daniel, who love cakes, is taking a trip to Nevada. Not gonna lie, i would go as well. 该正则表达式将其拆分为三个子句: C. Daniel, who love cakes

到目前为止,我发现这个正则表达式在我参与的几乎所有竞赛中都很有效

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=[.?])\s
进入:

不幸的是,它不包括一个案例。 例如,如果我有这样一句话:

C. Daniel, who love cakes, is taking a trip to Nevada. Not gonna lie, i would go as well.
该正则表达式将其拆分为三个子句:

C.
Daniel, who love cakes, is taking a trip to Nevada.
Not gonna lie, i would go as well.
而不是:

C. Daniel, who love cakes, is taking a trip to Nevada.
Not gonna lie, i would go as well.
我们缺少的是这种特殊情况,即当我们找到一个匹配时,它有一个大写字母表,后跟一个点(.),我们不必拆分


我仍然不知道如何正确使用正则表达式,因此如果你能告诉我为什么你的答案会有效,我将非常感谢你可以扩展这个模式,添加一个负lookbehind
(?声明不是大写字符,后面紧跟着一个

我想你也可以省略
\w.
后面的点,因为点匹配除换行符以外的任何字符

(?<!\b[A-Z]\.)(?<!\w\.\w)(?<![A-Z][a-z]\.)(?<=[.?])\s
(?

请参阅a

如果您想要基于非正则表达式的解决方案,可以在此处使用nltk

import nltk

txt_1 = "C. Daniel, who love cakes, is taking a trip to Nevada. Not gonna lie, i would go as well."

nltk.sent_tokenize(txt_1)

['C. Daniel, who love cakes, is taking a trip to Nevada.',
 'Not gonna lie, i would go as well.']

txt_2 = "Mr. Daniel, who love cakes, is taking a trip to Nevada. Not gonna lie, i would go as well."
nltk.sent_tokenize(txt_2)

['Mr. Daniel, who love cakes, is taking a trip to Nevada.',
 'Not gonna lie, i would go as well.']

似乎它起作用了!你能告诉我为什么你也放了a\b吗?@costabrava在这种情况下,当字符串以2个大写字符结尾时,它可能会分裂。
(?<!\b[A-Z]\.)(?<!\w\.\w)(?<![A-Z][a-z]\.)(?<=[.?])\s
import nltk

txt_1 = "C. Daniel, who love cakes, is taking a trip to Nevada. Not gonna lie, i would go as well."

nltk.sent_tokenize(txt_1)

['C. Daniel, who love cakes, is taking a trip to Nevada.',
 'Not gonna lie, i would go as well.']

txt_2 = "Mr. Daniel, who love cakes, is taking a trip to Nevada. Not gonna lie, i would go as well."
nltk.sent_tokenize(txt_2)

['Mr. Daniel, who love cakes, is taking a trip to Nevada.',
 'Not gonna lie, i would go as well.']