Python re.findall（）大写单词，包括撇号_Python_Regex_Python 3.x

Python re.findall（）大写单词，包括撇号

python regex python-3.x

Python re.findall（）大写单词，包括撇号,python,regex,python-3.x,Python,Regex,Python 3.x,我在完成一个正则表达式教程时遇到了问题，该教程使用了“在我的字符串中查找所有大写单词并打印结果”，其中一些单词带有撇号原始字符串： In [1]: my_string Out[1]: "Let's write RegEx! Won't that be fun? I sure think so. Can you find 4 sentences? Or perhaps, all 19 words?" 当前尝试： # Import the regex module import re

我在完成一个正则表达式教程时遇到了问题，该教程使用了“在我的字符串中查找所有大写单词并打印结果”，其中一些单词带有撇号

原始字符串：

In [1]: my_string
Out[1]: "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you 
find 4 sentences?  Or perhaps, all 19 words?"

当前尝试：

# Import the regex module
import re
# Find all capitalized words in my_string and print the result
capitalized_words = r"((?:[A-Z][a-z]+ ?)+)"
print(re.findall(capitalized_words, my_string))

当前结果：

['Let', 'RegEx', 'Won', 'Can ', 'Or ']

我认为期望的结果是：

['Let's', 'RegEx', 'Won't', 'Can't', 'Or']

你如何从r“（（？：[A-Z][A-Z]+？）+）”到在Let's，Wall't和Can't的末尾选择“s”和“t”，而不是所有人都想抓住的时候都会有一个撇号？

只需在第二个括号组中添加一个撇号即可：

capitalized_words = r"((?:[A-Z][a-z']+)+)"

我想您可以在组

[a-z']

中添加一个小撇号。所以它就像

（（？：[A-Z][A-Z']+？）+）

希望它能起作用当你有答案时，我想用nltk提供一个更“真实”的解决方案：

from nltk import sent_tokenize, regexp_tokenize

my_string = """Let's write RegEx!  Won't that be fun?  I sure think so.  Can you 
find 4 sentences?  Or perhaps, all 19 words?"""

sent = sent_tokenize(my_string)
print(len(sent))
# 5

pattern = r"\b(?i)[a-z][\w']*"
print(len(regexp_tokenize(my_string, pattern)))
# 19

在我看来，这是5句话，而不是4句，除非对句子有特殊要求。

[a-Z][a-Z]+

表示“a和Z以及a和Z之间的所有字母”。根据定义，范围不包括撇号。将它们添加到正则表达式中。也不需要空格

“？”

。教程从[a-z]开始，并\w+这样做是为了编写一个表达式，而没有解释如何组合基本原理，因此我通过搜索在我的原始帖子中找到了该表达式。不幸的是，我认为这个问题想要作为输出，而你的答案却完成了，这并不是本教程想要的结果。我完全忘记了以I开头的句子，我会看看我是否能理解一种捕获它的方法。你必须向编写教程的人投诉才能解决这个问题。