在Python中使用正则表达式将文本拆分为句子

在Python中使用正则表达式将文本拆分为句子,python,regex,split,Python,Regex,Split,我试图将一段示例文本拆分为一个句子列表,每个句子的末尾都没有分隔符和空格 示例文本: 当你第一次看到第二次文艺复兴时,它可能看起来很无聊。至少看两遍,一定要看第二部分。这将改变你对矩阵的看法。人类是战争的始作俑者吗?人工智能是一件坏事吗 输入此(所需输出): 我的代码当前为: def sent_tokenize(text): sentences = re.split(r"[.!?]", text) sentences = [sent.strip(" ") for sent in

我试图将一段示例文本拆分为一个句子列表,每个句子的末尾都没有分隔符和空格

示例文本:

当你第一次看到第二次文艺复兴时,它可能看起来很无聊。至少看两遍,一定要看第二部分。这将改变你对矩阵的看法。人类是战争的始作俑者吗?人工智能是一件坏事吗

输入此(所需输出):

我的代码当前为:

def sent_tokenize(text):
    sentences = re.split(r"[.!?]", text)
    sentences = [sent.strip(" ") for sent in sentences]
    return sentences
但是,此输出(电流输出):

请注意末尾的多余“”

关于如何删除当前输出末尾的多余“”有什么想法吗

nltk
sent\u标记化
如果您从事NLP业务,我强烈建议您使用
nltk
软件包

>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(text)
[
    'The first time you see The Second Renaissance it may look boring.',
    'Look at it at least twice and definitely watch part 2.',
    'It will change your view of the matrix.',
    'Are the human people the ones who started the war?',
    'Is AI a bad thing?'
] 
它比正则表达式更健壮,并且提供了很多选项来完成工作。有关更多信息,请访问

如果您对尾部分隔符很挑剔,可以使用稍微不同的模式使用
nltk.tokenize.RegexpTokenizer

>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer(r'[^.?!]+')
>>> list(map(str.strip, tokenizer.tokenize(text)))    
[
    'The first time you see The Second Renaissance it may look boring',
    'Look at it at least twice and definitely watch part 2',
    'It will change your view of the matrix',
    'Are the human people the ones who started the war',
    'Is AI a bad thing'
]

基于正则表达式的
re.split
如果必须使用
regex
,则需要通过添加负前瞻来修改模式-

>>> list(map(str.strip, re.split(r"[.!?](?!$)", text)))
[
    'The first time you see The Second Renaissance it may look boring',
    'Look at it at least twice and definitely watch part 2',
    'It will change your view of the matrix',
    'Are the human people the ones who started the war',
    'Is AI a bad thing?'
]

添加的
(?!$)
指定仅当您尚未到达行尾时才拆分。不幸的是,我不确定最后一句中的尾随分隔符是否可以合理地删除,而无需执行类似于
result[-1]=result[-1][:-1]
的操作。您可以使用过滤器来删除空元素

Ex:

import re
text = """The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?"""
def sent_tokenize(text):
    sentences = re.split(r"[.!?]", text)
    sentences = [sent.strip(" ") for sent in sentences]
    return filter(None, sentences)

print sent_tokenize(text)

您可以在拆分段落之前先
strip
,也可以在结果中过滤空字符串

有没有关于如何在我的当前任务结束时删除多余“”的想法 输出

您可以通过以下操作将其删除:

sentences[:-1]
或更快ᴄᴏʟᴅsᴘᴇᴇᴅ)

输出:

['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']

没有使用
nltk.sent\u tokenize
的任何原因?您可能会在末尾有一个额外的空间。只需检查一下。您可以从运行
.strip()
开始解决这个问题。@DyZ Great minds(参见我的答案):@cᴏʟᴅsᴘᴇᴇᴅ 我知道:)“没有分隔符”。查看所需的输出@ᴡʜᴀᴄᴋᴀᴍᴀᴅᴏᴏᴅʟᴇ这是一个小细节。不过我会看看我能做些什么@ᴡʜᴀᴄᴋᴀᴍᴀᴅᴏᴏᴅʟᴇ3000我已经添加了一个带有RegexpTokenizer的选项来解决这个问题。希望现在一切都好了!我必须使用正则表达式,没有访问nltk包的权限。正则表达式的答案有效,但在最后一句末尾留下了“?”。
(?)是一个lookbehind(请参见选项2说明)。您可能希望使用lookbehind,但是
(?=
(?!$)
,因为
$
是一个零宽度断言。
sentences[:-1]
del result[-1]
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']