在python中使用带有正向后视的regex分割字符串
为了解决其中一条评论,我的总体目标是理解如何实现一个正则表达式,该正则表达式允许我在后面积极或消极地使用单词边界,因为您似乎不能使用量词 因此,对于我的具体情况,我希望能够检查句点('.')之前的单词不是大写的单词。因此,在我的脑海中,我可以从两条不同的路径来处理这个问题: 1) 正查找后面“.”前面的单词都是小写,但是我收到的错误是正查找后面的宽度为零,因此我不能像这样使用量词“+”:在python中使用带有正向后视的regex分割字符串,python,regex,string,regex-lookarounds,Python,Regex,String,Regex Lookarounds,为了解决其中一条评论,我的总体目标是理解如何实现一个正则表达式,该正则表达式允许我在后面积极或消极地使用单词边界,因为您似乎不能使用量词 因此,对于我的具体情况,我希望能够检查句点('.')之前的单词不是大写的单词。因此,在我的脑海中,我可以从两条不同的路径来处理这个问题: 1) 正查找后面“.”前面的单词都是小写,但是我收到的错误是正查找后面的宽度为零,因此我不能像这样使用量词“+”:(?)? 我更倾向于对选项1进行一些修改,因为它对我来说更有意义,尽管我对其他建议持开放态度。我能在这里使用单
(?)?
我更倾向于对选项1进行一些修改,因为它对我来说更有意义,尽管我对其他建议持开放态度。我能在这里使用单词边界吗
我用这句话把这段话分成了几个句子,我想用regex而不是nltk。问题主要在于处理名字的首字母或缩写
当前正则表达式:
(?<=[^A-Z][a-z])\.(?=\s[A-Z]+)
期望输出:
Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.
对于您的特殊情况,我建议您使用re.sub
。您的正则表达式通过这种方式简化了很多,并且您不需要使用lookback,因为它们有很多限制(需要固定宽度等等)
代码
print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\1\n', text, re.M))
输出
Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.
正则表达式详细信息
( # first capture group
\b # word boundary
[a-z]+ # lower case a-z
\. # literal period
\s* # any other whitespace characters (added for cosmetic effect)
(?!$) # negative lookahead - don't insert a newline when you're at the end of a sentence
)
此模式由以下内容取代:
\1 # reference to the first capture group
\n # a newline
对于您的特殊情况,我建议您使用re.sub
。您的正则表达式通过这种方式简化了很多,并且您不需要使用lookback,因为它们有很多限制(需要固定宽度等等)
代码
print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\1\n', text, re.M))
输出
Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.
正则表达式详细信息
( # first capture group
\b # word boundary
[a-z]+ # lower case a-z
\. # literal period
\s* # any other whitespace characters (added for cosmetic effect)
(?!$) # negative lookahead - don't insert a newline when you're at the end of a sentence
)
此模式由以下内容取代:
\1 # reference to the first capture group
\n # a newline
如果您想创建一个句子列表,这里有另一个选项:
# Split into sentences (last word is split off too)
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)
['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']
# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]
['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']
如果您想创建一个句子列表,这里有另一个选项:
# Split into sentences (last word is split off too)
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)
['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']
# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]
['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']
试一试
如果是多行,则使用以下命令:-
lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)
他们两个都会产生
['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']
解释'.+?\b(?[A-Z])\w+\.
.+? #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+ #the whole word
\. #followed by a dot
测试正则表达式。
测试代码。试试看
如果是多行,则使用以下命令:-
lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)
他们两个都会产生
['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']
解释'.+?\b(?[A-Z])\w+\.
.+? #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+ #the whole word
\. #followed by a dot
测试正则表达式。
测试代码。你用了一句话来解释目标是什么,页面的其余部分则讨论了你的方法。最好把更多的注意力放在描述问题上。从你的描述来看,你可能不需要正则表达式(或者使用更简单的正则表达式)就可以做到这一点相反。主要是想表明我已经花了一些精力试图解决手头的问题。还有一些问题与这个问题无关,例如其他拆分选项,但是姓名首字母目前给我带来的麻烦最大。你用了一句话来解释目标和t他在本页的其余部分讨论了您的方法。最好将重点更多地放在描述问题上。从您的描述来看,您可能不需要正则表达式(或使用更简单的正则表达式)就可以做到这一点相反。主要是想表明我已经花了一些精力试图解决手头的问题。还有一些问题与此问题无关,例如其他拆分选项,但是姓名首字母目前给我带来的麻烦最大。@RickAhlf-Whoops.Fixed.@RickAhlf-Whoops.Fixed。