Python：从行中提取句子-需要基于条件的正则表达式_Python_Regex

Python：从行中提取句子-需要基于条件的正则表达式

python regex

Python：从行中提取句子-需要基于条件的正则表达式,python,regex,Python,Regex,这里有点像python/编程新手我正试图想出一个正则表达式，它可以处理从文本文件中的一行中提取句子，然后将它们附加到列表中。守则： import re txt_list = [] with open('sample.txt', 'r') as txt: patt = r'.*}[.!?]\s?\n?|.*}.+[.!?]\s?\n?' read_txt = txt.readlines() for line in read_txt: if line

这里有点像python/编程新手

我正试图想出一个正则表达式，它可以处理从文本文件中的一行中提取句子，然后将它们附加到列表中。守则：

import re

txt_list = []

with open('sample.txt', 'r') as txt:
    patt = r'.*}[.!?]\s?\n?|.*}.+[.!?]\s?\n?'
    read_txt = txt.readlines()

    for line in read_txt:
        if line == "\n":
            txt_list.append("\n")
        else: 
            found = re.findall(patt, line)
            for f in found:
                txt_list.append(f)


for line in txt_list:
    if line == "\n":
        print "newline"
    else:
        print line

按上述代码最后5行打印输出：

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! 
What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

newline
I am the {very last|last} sentence for this {instance|example}.

“sample.txt”的内容：

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

I am the {very last|last} sentence for this {instance|example}.

我已经玩了几个小时的正则表达式，我似乎无法破解它。目前，午餐的

结尾处的正则表达式不匹配。

。因此这两句话

我们午餐应该吃什么？芒乔森博士说，豌豆的价值是千分之一；{那是}他说的。

没有分开；这就是我想要的

正则表达式的几个重要细节：

每个句子都会以句号、感叹号或问号结尾

每个句子将始终包含至少一对带某些单词的花括号“{}”此外，在每个句子的最后一个括号后不会出现误导性的“.”。因此，
Dr.
将始终位于每个句子的最后一对花括号之前。这就是为什么我试图使用“}”作为正则表达式的基础。通过这种方式，我可以避免使用异常方法，即为

Dr.

、

Jr.

、

近似值等语法创建异常。对于我运行此代码的每个文件，我个人确保在任何句子中最后一个“}”之后都没有“误导性句点”


我想要的输出是：
{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! 
What {will|shall|should} we {eat|have} for lunch?
Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

newline
I am the {very last|last} sentence for this {instance|example}.

我得到的最直观的解决方案就是这个。本质上，您需要将Dr.
和Mr.
标记本身视为原子
patt = r'(?:Dr\.|Mr\.|.)*?[.!?]\s?\n?'

分解后，它说：
为我找到最少数量的Mr.
s、Dr.
s或任何一个字符，最多一个拼音标记，后跟零或一个空格，后跟零或一行新行
在这个sample.txt上使用时（我添加了一行）：
它给出：
{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}!
What {will|shall|should} we {eat|have} for lunch?
Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

newline
But there are no {misters|doctors} here good sir!
Help us if there is an emergency.

newline
I am the {very last|last} sentence for this {instance|example}.

如果您不介意添加一个依赖项，那么NLTK库有一个sent\u tokenize
函数，它可以满足您的需要，尽管我不能完全确定花括号是否会产生干扰
描述NLTK使用的方法的论文长达40多页。检测句子边界不是一项简单的任务
 您好，我真的希望避免以这种方式使用异常，因为我将在许多文本文件上运行此代码。因此可能存在许多例外情况，例如Sr.
，近似值
，info.
，e.t.c
等等。我现在唯一可以肯定的是，在最后一个结束时，每个句子中都不会有
，
，当然，除非
实际上是句子的结尾。我认为你不能有一个通用的。你可以说最后一个花括号后面的句点不是缩写，但是正则表达式如何知道}Dr.
和}said.
之间的区别呢？它们都是后跟句号的字母。我们可以说前者有大写字母，但这排除了约
，信息
，e.t.c
，等等。。。如果不在一句话中声明什么是“有效”期，我认为你不能做你要求的事情。我明白了，我希望我的正则表达式技能不足以胜任这项任务。好吧，我想我会屈服于实质性例外方法。感谢您的帮助：-）由于此答案和@aelfric5578给出的答案都非常有用，因此要确定哪一个答案被接受并不容易。基本上我都接受。然而，我不得不同意这一点，因为它证明了正则表达式是不起作用的。在这个“{你好，你好{你好}蒙乔森博士，你{绅士{好家伙}！”中有误导性。“这里。第一个句号也在这个“句子”的最后一个括号之后：“{你好，你好{你好}博士。”在这句话中，“！”是句子的结尾，还有“！”在那句话的最后一个“}”之后。正如我在OP中解释的，一个句子可以以句号、感叹号或问号结尾。谢谢。我知道存在sent\u tokenize
，尽管我还没有尝试过。我本想让我的脚本在卷曲括号和正则表达式的后面工作，但看起来好像没有发生。我刚刚尝试了sent_tokenize
，在几个包含这些花括号的文件上，它准确地分割了所有的句子，所以我想我会坚持使用它。它不会以我想要的方式保留换行符，但我可以为此编写一些代码。Cheers与标记整个文本块不同，您可以将每一行分别馈送到sent\u tokenize。只要您不希望句子超出源文件中的一行，就可以按您希望的方式保留换行符。
{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}!
What {will|shall|should} we {eat|have} for lunch?
Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

newline
But there are no {misters|doctors} here good sir!
Help us if there is an emergency.

newline
I am the {very last|last} sentence for this {instance|example}.