在Python中重复捕获会产生奇怪的结果_Python_Regex

在Python中重复捕获会产生奇怪的结果

python regex

在Python中重复捕获会产生奇怪的结果,python,regex,Python,Regex,我想重复一次，让自然数出现，并抓住所有的自然数 import re r = "the ((sixty|six)[ -]+)+items" s = "the sixty six items" re.findall(r, s) # [('six ', 'six')] 它与“六”匹配了2次，但可以观察到它在“六”上可能从未匹配过；相反，它必须在“六十六”上匹配，但捕获返回（“六”，“六”）这里发生了什么以及如何返回（“六十”、“六”）？如果使用（组）+，则组中将只捕获最后匹配的文本您应该对稍微不

我想重复一次，让自然数出现，并抓住所有的自然数

import re
r = "the ((sixty|six)[ -]+)+items"
s = "the sixty six items"
re.findall(r, s)
# [('six ', 'six')]

它与“六”匹配了2次，但可以观察到它在“六”上可能从未匹配过；相反，它必须在“六十六”上匹配，但捕获返回（“六”，“六”）

这里发生了什么以及如何返回（“六十”、“六”）？

如果使用

（组）+

，则组中将只捕获最后匹配的文本

您应该对稍微不同的正则表达式使用

findall

s = 'the sixty six items'

>>> if re.match(r'the (?:(?:sixty|six)[ -]+)+items', s):
...     re.findall(r"\b(sixty|six)[ -]+(?=.*\bitems\b)", s)
...
['sixty', 'six']

您的问题包含以下代码：

>>> r = "the ((sixty|six)[ -]+)+items"
>>> s = "the sixty six items"
>>> re.findall(r, s)

返回的是

[（'six'，'six'）]

，因为在您的组后面使用了量词，即

（（sixth | six）[-]+）+

findall

返回两个

捕获的组#1

是

的“六个”

（请注意此处的空格，因为第一个组中的

[-]+

）

捕获的组#2

是

“六个”

（内部组，即

（六十个）

）

试试正则表达式

re.findall('(six\w*)', s)

使用

\b

断言：希望这有帮助

>>> s = "the sixty six items"
>>> print(re.findall(r'(?is)(\bsixty\b|\bsix\b)',s))
['sixty', 'six']

\b

断言将避免错误命中，例如：如果您添加了十六个，但不希望匹配

没有

\b

>>> s = "the sixty sixteen six items"
>>> print(re.findall(r'(?is)(sixty|six)',s))
['sixty', 'six', 'six']

使用

\b

（优势）

re.search

只找到与模式匹配的第一个对象，一旦找到匹配的对象，它就不再寻找进一步的匹配。您得到的是

（'six'，'six'）

，因为一个捕获组嵌套在另一个捕获组中；

'six'

匹配外部组，而

'six'

（不带尾随空格）匹配内部组

您可以在一些非捕获组中使用两个未嵌套的捕获组来执行所需操作，这些非捕获组使用

（？：…）

语法

import re

r = "the (?:(?:(sixty)|(six))[ -]+)+items"
s = "the sixty six items"
m = re.search(r, s)
if m:
    print(m.groups())

输出

('sixty', 'six')

'the items' -> None
'the six items' -> (None, 'six')
'the six six items' -> (None, 'six')
'the sixty items' -> ('sixty', None)
'the six sixty items' -> ('sixty', 'six')
'the sixty six items' -> ('sixty', 'six')
'the sixty-six items' -> ('sixty', 'six')
'the six sixty sixty items' -> ('sixty', 'six')

这将返回两个项的元组，因为模式中有两个捕获组

这里有一个较长的演示

import re

pat = re.compile("the (?:(?:(sixty)|(six))[ -]+)+items")

data = (
    "the items",
    "the six items",
    "the six six items",
    "the sixty items",
    "the six sixty items",
    "the sixty six items",
    "the sixty-six items",
    "the six sixty sixty items",
)

for s in data:
    m = pat.search(s)
    print('{!r} -> {}'.format(s, m.groups() if m else None))

输出

('sixty', 'six')

'the items' -> None
'the six items' -> (None, 'six')
'the six six items' -> (None, 'six')
'the sixty items' -> ('sixty', None)
'the six sixty items' -> ('sixty', 'six')
'the sixty six items' -> ('sixty', 'six')
'the sixty-six items' -> ('sixty', 'six')
'the six sixty sixty items' -> ('sixty', 'six')

它不可能在“六六”上匹配。

。。。是的，它可能是错的，

（sixth | six）

意味着匹配

sixth

或

sixth

@TimBiegeleisen，因为如果它必须连续匹配它们，那么在sixth-six情况下，“ty”在哪里匹配？嗯，实际的例子是这只是其中的一部分

findall

也返回

（“四”，“四”）

，但该组0不可能是“六”，因此似乎发生了一些奇怪的事情？问题是我想在更高级别上执行“re.findall”。例如，我想多次匹配

，问题是：

re.findall（r）（六十六）[-]+（？=.*\bitems）”，s+s+s）

，将以平面列表的形式返回它，我们将失去这个更高级别的findall匹配分组。我正在考虑这个解决方案，非常感谢您的帮助！只是我发现这种“实际问题”的行为是无法解释的！我已尝试在更新的答案中用实际问题解释问题。可能会得到错误的答案，例如，我不想点击“第十六个”，当我们还包括

seven

和

seven

时，这个比例如何？不知怎的，我在这两者之间得到了很多

None

匹配项。使用

r=“the（？：（？：（六十）|（六）|（七）|（十七））[-]+）+items”

@PascalvKooten:该模式有4个捕获组，因此当您使用它进行搜索时

。groups

将返回4个元素的元组，每个元素对应于模式中的捕获组。因此我想我将对“-teents”、“-ties”和那些没有后缀的元素进行分组，然后在那里有3个可能的组。。。。也谢谢你的解释。