Python正则表达式匹配2个不同的分隔符_Python_Regex_Regex Negation

Python正则表达式匹配2个不同的分隔符

python regex

Python正则表达式匹配2个不同的分隔符,python,regex,regex-negation,Python,Regex,Regex Negation,我正在尝试创建一个正则表达式，它将匹配以下内容： [[uid:：页面名称|页面别名]] 例如： [[nw:：Home | Home page]] uid和页面别名都是可选的我想允许分隔符：或|仅显示一次，并且仅按显示的顺序显示。但是，字符：应允许位于uid之后的任何位置。问题就在这里下面的正则表达式工作得很好，只是它与出现两次或出现在错误位置的字符串相匹配： regex = r'\[\[([\w]+::)?([^|\t\n\r\f\v]+)(\|[^|\t\n\r\f\v]+)?\]\]'

我正在尝试创建一个正则表达式，它将匹配以下内容：

[[uid:：页面名称|页面别名]]

例如：

[[nw:：Home | Home page]]

uid和页面别名都是可选的

我想允许分隔符

：

或

仅显示一次，并且仅按显示的顺序显示。但是，字符

：

应允许位于uid之后的任何位置。问题就在这里

下面的正则表达式工作得很好，只是它与出现两次或出现在错误位置的字符串相匹配：

regex = r'\[\[([\w]+::)?([^|\t\n\r\f\v]+)(\|[^|\t\n\r\f\v]+)?\]\]'
re.match(regex, '[[Home]]') # matches, good
re.match(regex, '[[Home|Home page]]') # matches, good
re.match(regex, '[[nw::Home]]') # matches, good
re.match(regex, '[[nw::Home|Home page]]') # matches, good
re.match(regex, '[[nw|Home|Home page]]') # doesn't match, good
re.match(regex, '[[nw|Home::Home page]]') # matches, bad
re.match(regex, '[[nw::Home::Home page]]') # matches, bad

我已经阅读了所有关于负先行和后向表达式的内容，但我不知道如何在这种情况下应用它们。如有任何建议，将不胜感激

编辑：我还想知道如何防止在匹配结果中包含分隔符，如下所示：

（'nw:：'，'Home'，'| Home page'）

如果我正确理解您的需求，您可以使用此：

\[\[(?:(?<uid>\w+)::)?(?!.*::)(?<page>[^|\t\n\r\f\v]+)(?:\|(?<alias>[^|\t\n\r\f\v]+))?\]\]
                      ^^^^^^^^

那么，你觉得这个怎么样：

import re

regex = r'''
    \[\[                            # opening [[
        ([\w ]+)                    # first word (with possible spaces)
        (?:
            ::                      # the two colons
            (                       # second word (with possible spaces and single colons)
                [\w ]+              # word characters and spaces
                (?:
                    :               # a colon
                    [\w ]+          # word characters and spaces
                )*                  # not required, but can repeat unlimitted
            )
        )?                          # not required
        (?:
            \|                      # a pipe
            ([\w ]+)                # thid word (with possible spaces)
        )?
    \]\]                            # closing ]]
'''

test_strings = (
    '[[Home]]',
    '[[Home|Home page]]',
    '[[nw::Home]]',
    '[[nw::Home|Home page]]',
    '[[nw|Home|Home page]]',
    '[[nw|Home::Home page]]',
    '[[nw::Home::Home page]]',
    '[[nw::Home:Home page]]',
    '[[nw::Home:Home page|Home page]]'
)

for test_string in test_strings:
    print re.findall(regex, test_string, re.X)

产出：

[('Home', '', '')]
[('Home', '', 'Home page')]
[('nw', 'Home', '')]
[('nw', 'Home', 'Home page')]
[]
[]
[]
[('nw', 'Home:Home page', '')]

它不使用lookaheads/behinds。它允许在第一个

：

（如最后两个测试字符串所示）之后的字符串中使用单冒号。正则表达式的简短版本为：

\[\[([\w ]+)(?:::([\w ]+(?::[\w ]+)*))?(?:\|([\w ]+))?\]\]

唯一的问题是，您必须检查第二个匹配项是否为空，如果为空，则不存在双冒号（

：

），您应该使用第一个匹配项，冒号前的字符串通常为空。

这有效吗

你能详细说明你所说的“然而，字符：应该在uid之后的任何地方被允许”是什么意思吗？您给出的所有匹配项/非匹配项似乎都没有任何奇怪的字符顺序外观。这有点类似于为C注释编写正确的正则表达式的问题：

/***/

这是可以做到的，但很棘手。查找“C comment regex”以了解想法。

？P

而不是

？

@falsetru:谢谢！我忘了python有一种稍微不同的命名捕获组的方法。它是这样的：

\[（？：（？：（？P\w+）：）（？！*：）（？P[^\t\n\r\f\v]+）（？：\\（？P[^\t\n\r\f\v]+）？\]\]

@nw。令人惊叹的！我怕我错过了什么！

\[\[([\w ]+)(?:::([\w ]+(?::[\w ]+)*))?(?:\|([\w ]+))?\]\]

import re
regex = r'\[\[(([\w]+)::)?([^|\t\n\r\f\v]+)(\|([^\t\n\r\f\v]+))?\]\]'
print re.match(regex, '[[Home]]').group(2,3,5) # matches, good
print re.match(regex, '[[Home|Home page]]').group(2,3,5) # matches, good
print re.match(regex, '[[nw::Home]]').group(2,3,5) # matches, good
print re.match(regex, '[[nw::Home|Home page]]').group(2,3,5) # matches, good
print re.match(regex, '[[nw|Home|Home page]]').group(2,3,5) # doesn't match, good
print re.match(regex, '[[nw|Home::Home page]]').group(2,3,5) # matches, bad
print re.match(regex, '[[nw::Home::Home page]]').group(2,3,5) # matches, bad