在python中使用带有正向后视的regex分割字符串

在python中使用带有正向后视的regex分割字符串,python,regex,string,regex-lookarounds,Python,Regex,String,Regex Lookarounds,为了解决其中一条评论,我的总体目标是理解如何实现一个正则表达式,该正则表达式允许我在后面积极或消极地使用单词边界,因为您似乎不能使用量词 因此,对于我的具体情况,我希望能够检查句点('.')之前的单词不是大写的单词。因此,在我的脑海中,我可以从两条不同的路径来处理这个问题: 1) 正查找后面“.”前面的单词都是小写,但是我收到的错误是正查找后面的宽度为零,因此我不能像这样使用量词“+”:(?)? 我更倾向于对选项1进行一些修改,因为它对我来说更有意义,尽管我对其他建议持开放态度。我能在这里使用单

为了解决其中一条评论,我的总体目标是理解如何实现一个正则表达式,该正则表达式允许我在后面积极或消极地使用单词边界,因为您似乎不能使用量词

因此,对于我的具体情况,我希望能够检查句点('.')之前的单词不是大写的单词。因此,在我的脑海中,我可以从两条不同的路径来处理这个问题:

1) 正查找后面“.”前面的单词都是小写,但是我收到的错误是正查找后面的宽度为零,因此我不能像这样使用量词“+”:
(?)?
我更倾向于对选项1进行一些修改,因为它对我来说更有意义,尽管我对其他建议持开放态度。我能在这里使用单词边界吗

我用这句话把这段话分成了几个句子,我想用regex而不是nltk。问题主要在于处理名字的首字母或缩写

当前正则表达式:

(?<=[^A-Z][a-z])\.(?=\s[A-Z]+)
期望输出:

Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.

对于您的特殊情况,我建议您使用
re.sub
。您的正则表达式通过这种方式简化了很多,并且您不需要使用lookback,因为它们有很多限制(需要固定宽度等等)

代码

print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\1\n', text, re.M))

输出

Koehler rides the bus. 
Bowman was passed into the first grade; Koehler advanced to third grade. 
Jon. Williams walked down the road to school. 
Bowman decided to go fishing; Koehler did not. 
C. Robinson asked to go to recess, and the teacher said no.

正则表达式详细信息

(         # first capture group
\b        # word boundary
[a-z]+    # lower case a-z
\.        # literal period
\s*       # any other whitespace characters (added for cosmetic effect)
(?!$)     # negative lookahead - don't insert a newline when you're at the end of a sentence
)
此模式由以下内容取代:

\1        # reference to the first capture group 
\n        # a newline

对于您的特殊情况,我建议您使用
re.sub
。您的正则表达式通过这种方式简化了很多,并且您不需要使用lookback,因为它们有很多限制(需要固定宽度等等)

代码

print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\1\n', text, re.M))

输出

Koehler rides the bus. 
Bowman was passed into the first grade; Koehler advanced to third grade. 
Jon. Williams walked down the road to school. 
Bowman decided to go fishing; Koehler did not. 
C. Robinson asked to go to recess, and the teacher said no.

正则表达式详细信息

(         # first capture group
\b        # word boundary
[a-z]+    # lower case a-z
\.        # literal period
\s*       # any other whitespace characters (added for cosmetic effect)
(?!$)     # negative lookahead - don't insert a newline when you're at the end of a sentence
)
此模式由以下内容取代:

\1        # reference to the first capture group 
\n        # a newline

如果您想创建一个句子列表,这里有另一个选项:

# Split into sentences (last word is split off too)    
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)

['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']

# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]

['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']

如果您想创建一个句子列表,这里有另一个选项:

# Split into sentences (last word is split off too)    
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)

['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']

# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]

['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']
试一试

如果是多行,则使用以下命令:-

lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)
他们两个都会产生

['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']
解释
'.+?\b(?[A-Z])\w+\.

.+?       #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b        #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+       #the whole word
\.        #followed by a dot
测试正则表达式。
测试代码。

试试看

如果是多行,则使用以下命令:-

lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)
他们两个都会产生

['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']
解释
'.+?\b(?[A-Z])\w+\.

.+?       #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b        #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+       #the whole word
\.        #followed by a dot
测试正则表达式。

测试代码。

你用了一句话来解释目标是什么,页面的其余部分则讨论了你的方法。最好把更多的注意力放在描述问题上。从你的描述来看,你可能不需要正则表达式(或者使用更简单的正则表达式)就可以做到这一点相反。主要是想表明我已经花了一些精力试图解决手头的问题。还有一些问题与这个问题无关,例如其他拆分选项,但是姓名首字母目前给我带来的麻烦最大。你用了一句话来解释目标和t他在本页的其余部分讨论了您的方法。最好将重点更多地放在描述问题上。从您的描述来看,您可能不需要正则表达式(或使用更简单的正则表达式)就可以做到这一点相反。主要是想表明我已经花了一些精力试图解决手头的问题。还有一些问题与此问题无关,例如其他拆分选项,但是姓名首字母目前给我带来的麻烦最大。@RickAhlf-Whoops.Fixed.@RickAhlf-Whoops.Fixed。