在python中使用带有正向后视的regex分割字符串_Python_Regex_String_Regex Lookarounds

在python中使用带有正向后视的regex分割字符串

python regex string

在python中使用带有正向后视的regex分割字符串,python,regex,string,regex-lookarounds,Python,Regex,String,Regex Lookarounds,为了解决其中一条评论，我的总体目标是理解如何实现一个正则表达式，该正则表达式允许我在后面积极或消极地使用单词边界，因为您似乎不能使用量词因此，对于我的具体情况，我希望能够检查句点（'.'）之前的单词不是大写的单词。因此，在我的脑海中，我可以从两条不同的路径来处理这个问题： 1）正查找后面“.”前面的单词都是小写，但是我收到的错误是正查找后面的宽度为零，因此我不能像这样使用量词“+”：（？）？我更倾向于对选项1进行一些修改，因为它对我来说更有意义，尽管我对其他建议持开放态度。我能在这里使用单

为了解决其中一条评论，我的总体目标是理解如何实现一个正则表达式，该正则表达式允许我在后面积极或消极地使用单词边界，因为您似乎不能使用量词

因此，对于我的具体情况，我希望能够检查句点（'.'）之前的单词不是大写的单词。因此，在我的脑海中，我可以从两条不同的路径来处理这个问题：

1）正查找后面“.”前面的单词都是小写，但是我收到的错误是正查找后面的宽度为零，因此我不能像这样使用量词“+”：

（？）？
我更倾向于对选项1进行一些修改，因为它对我来说更有意义，尽管我对其他建议持开放态度。我能在这里使用单词边界吗
我用这句话把这段话分成了几个句子，我想用regex而不是nltk。问题主要在于处理名字的首字母或缩写
当前正则表达式：
(?<=[^A-Z][a-z])\.(?=\s[A-Z]+)

期望输出：
Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.

对于您的特殊情况，我建议您使用re.sub
。您的正则表达式通过这种方式简化了很多，并且您不需要使用lookback，因为它们有很多限制（需要固定宽度等等）
代码
print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\1\n', text, re.M))


输出
Koehler rides the bus. 
Bowman was passed into the first grade; Koehler advanced to third grade. 
Jon. Williams walked down the road to school. 
Bowman decided to go fishing; Koehler did not. 
C. Robinson asked to go to recess, and the teacher said no.


正则表达式详细信息
(         # first capture group
\b        # word boundary
[a-z]+    # lower case a-z
\.        # literal period
\s*       # any other whitespace characters (added for cosmetic effect)
(?!$)     # negative lookahead - don't insert a newline when you're at the end of a sentence
)

此模式由以下内容取代：
\1        # reference to the first capture group 
\n        # a newline

对于您的特殊情况，我建议您使用re.sub
。您的正则表达式通过这种方式简化了很多，并且您不需要使用lookback，因为它们有很多限制（需要固定宽度等等）
代码
print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\1\n', text, re.M))


输出
Koehler rides the bus. 
Bowman was passed into the first grade; Koehler advanced to third grade. 
Jon. Williams walked down the road to school. 
Bowman decided to go fishing; Koehler did not. 
C. Robinson asked to go to recess, and the teacher said no.


正则表达式详细信息
(         # first capture group
\b        # word boundary
[a-z]+    # lower case a-z
\.        # literal period
\s*       # any other whitespace characters (added for cosmetic effect)
(?!$)     # negative lookahead - don't insert a newline when you're at the end of a sentence
)

此模式由以下内容取代：
\1        # reference to the first capture group 
\n        # a newline

如果您想创建一个句子列表，这里有另一个选项：
# Split into sentences (last word is split off too)    
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)

['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']

# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]

['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']

如果您想创建一个句子列表，这里有另一个选项：
# Split into sentences (last word is split off too)    
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)

['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']

# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]

['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']

试一试
如果是多行，则使用以下命令：-
lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)

他们两个都会产生
['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']

解释'.+？\b（？[A-Z]）\w+\.

.+?       #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b        #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+       #the whole word
\.        #followed by a dot

测试正则表达式。

测试代码。
试试看
如果是多行，则使用以下命令：-
lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)

他们两个都会产生
['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']

解释'.+？\b（？[A-Z]）\w+\.

.+?       #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b        #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+       #the whole word
\.        #followed by a dot

测试正则表达式。

测试代码。
你用了一句话来解释目标是什么，页面的其余部分则讨论了你的方法。最好把更多的注意力放在描述问题上。从你的描述来看，你可能不需要正则表达式（或者使用更简单的正则表达式）就可以做到这一点相反。主要是想表明我已经花了一些精力试图解决手头的问题。还有一些问题与这个问题无关，例如其他拆分选项，但是姓名首字母目前给我带来的麻烦最大。你用了一句话来解释目标和t他在本页的其余部分讨论了您的方法。最好将重点更多地放在描述问题上。从您的描述来看，您可能不需要正则表达式（或使用更简单的正则表达式）就可以做到这一点相反。主要是想表明我已经花了一些精力试图解决手头的问题。还有一些问题与此问题无关，例如其他拆分选项，但是姓名首字母目前给我带来的麻烦最大。@RickAhlf-Whoops.Fixed.@RickAhlf-Whoops.Fixed。