Python执行\G锚定解析循环的方式是什么？_Python_Regex

Python执行\G锚定解析循环的方式是什么？

python regex

Python执行\G锚定解析循环的方式是什么？,python,regex,Python,Regex,下面是我几年前编写的一个perl函数。它是一个智能标记器，可以识别一些可能不应该粘在一起的东西。例如，给定左侧的输入，它会分割字符串，如右侧所示：我现在正在做一些机器学习实验，我想做一些使用这个标记器的实验。但首先，我需要将它从Perl移植到Python。这段代码的关键是使用\G锚点的循环，我听说python中不存在这种情况。我曾尝试用谷歌搜索Python是如何做到这一点的，但我不确定到底要搜索什么，所以我很难找到答案您将如何用Python编写此函数 sub Tokenize # Break

下面是我几年前编写的一个perl函数。它是一个智能标记器，可以识别一些可能不应该粘在一起的东西。例如，给定左侧的输入，它会分割字符串，如右侧所示：

我现在正在做一些机器学习实验，我想做一些使用这个标记器的实验。但首先，我需要将它从Perl移植到Python。这段代码的关键是使用\G锚点的循环，我听说python中不存在这种情况。我曾尝试用谷歌搜索Python是如何做到这一点的，但我不确定到底要搜索什么，所以我很难找到答案

您将如何用Python编写此函数

sub Tokenize
# Breaks a string into tokens using special rules,
# where a token is any sequence of characters, be they a sequence of letters, 
# a sequence of numbers, or a sequence of non-alpha-numeric characters
# the list of tokens found are returned to the caller
{
    my $value = shift;
    my @list = ();
    my $word;

    while ( $value ne '' && $value =~ m/
        \G                # start where previous left off
        ([^a-zA-Z0-9]*)   # capture non-alpha-numeric characters, if any
        ([a-zA-Z0-9]*?)   # capture everything up to a token boundary
        (?:               # identify the token boundary
            (?=[^a-zA-Z0-9])       # next character is not a word character 
        |   (?=[A-Z][a-z])         # Next two characters are upper lower
        |   (?<=[a-z])(?=[A-Z])    # lower followed by upper
        |   (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
                # ordinal boundaries
        |   (?<=^1(?i:st))         # first
        |   (?<=[^1][1](?i:st))    # first but not 11th
        |   (?<=^2(?i:nd))         # second
        |   (?<=[^1]2(?i:nd))      # second but not 12th
        |   (?<=^3(?i:rd))         # third
        |   (?<=[^1]3(?i:rd))      # third but not 13th
        |   (?<=1[123](?i:th))     # 11th - 13th
        |   (?<=[04-9](?i:th))     # other ordinals
                # non-ordinal digit-letter boundaries
        |   (?<=^1)(?=[a-zA-Z])(?!(?i)st)       # digit-letter but not first
        |   (?<=[^1]1)(?=[a-zA-Z])(?!(?i)st)    # digit-letter but not 11th
        |   (?<=^2)(?=[a-zA-Z])(?!(?i)nd)       # digit-letter but not first
        |   (?<=[^1]2)(?=[a-zA-Z])(?!(?i)nd)    # digit-letter but not 12th
        |   (?<=^3)(?=[a-zA-Z])(?!(?i)rd)       # digit-letter but not first
        |   (?<=[^1]3)(?=[a-zA-Z])(?!(?i)rd)    # digit-letter but not 13th
        |   (?<=1[123])(?=[a-zA-Z])(?!(?i)th)   # digit-letter but not 11th - 13th
        |   (?<=[04-9])(?=[a-zA-Z])(?!(?i)th)   # digit-letter but not ordinal
        |   (?=$)                               # end of string
        )
    /xg )
    {
        push @list, $1 if $1 ne '';
        push @list, $2 if $2 ne '';
    }
    return @list;
}

子标记化
#使用特殊规则将字符串拆分为标记，
#其中，标记是任意字符序列，可以是字母序列，
#数字序列或非字母数字字符序列
#找到的令牌列表将返回给调用方
{
我的$value=shift；
我的@list=（）；
我的话；
而（$value-ne''&&$value=~m/
\G#从上一次中断的地方开始
（[^a-zA-Z0-9]*）#捕获非字母数字字符（如有）
（[a-zA-Z0-9]*？）#捕获所有标记边界内的内容
（？：#识别令牌边界
（？=[^a-zA-Z0-9]）#下一个字符不是单词字符
|（？=[A-Z][A-Z]）#接下来的两个字符是上下两个字符
|（？使用re.RegexObject.match
您可以使用re
模块模拟正则表达式开头的\G
效果，方法是跟踪并提供起始位置，从而强制匹配从pos
中的指定位置开始
def tokenize(w):
    index = 0
    m = matcher.match(w, index)
    o = []
    # Although index != m.end() check zero-length match, it's more of
    # a guard against accidental infinite loop.
    # Don't expect a regex which can match empty string to work.
    # See Caveat section.
    while m and index != m.end():
        o.append(m.group(1))
        index = m.end()
        m = matcher.match(w, index)
    return o

警告
该方法的一个警告是，它不能很好地处理与主匹配中的空字符串匹配的正则表达式，因为Python没有任何工具来强制正则表达式重试匹配，同时阻止零长度匹配
例如，re.findall（r'（.？），'abc'）
返回一个由4个空字符串组成的数组[''，''，''，''，]
，而在PCRE中，您可以找到7个匹配项

其中，第2、第4和第6个匹配分别从与第1、第3和第5个匹配相同的索引开始。PCRE中的其他匹配可通过在相同索引处重试来找到，该索引带有防止空字符串匹配的标志

我知道问题是关于Perl，而不是PCRE，但是全局匹配行为应该是相同的，否则，原始代码就不可能工作

如问题中所述，将

（[^a-zA-Z0-9]*）（[a-zA-Z0-9]*？）

重写为

（.+？）

，可以避免此问题，尽管您可能希望使用此标志

关于正则表达式的其他评论由于Python中的不区分大小写标志会影响整个模式，因此必须重写不区分大小写的子模式。我会将

（？I:st）

重写为

[sS][tT]

以保留原始含义，但如果这是您需求的一部分，请使用

（？：st | st）

由于Python支持，您可以编写与Perl代码类似的正则表达式：

matcher = re.compile(r'''
    (.+?)
    (?:               # identify the token boundary
        (?=[^a-zA-Z0-9])       # next character is not a word character 
    |   (?=[A-Z][a-z])         # Next two characters are upper lower
    |   (?<=[a-z])(?=[A-Z])    # lower followed by upper
    |   (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
            # ordinal boundaries
    |   (?<=^1[sS][tT])         # first
    |   (?<=[^1][1][sS][tT])    # first but not 11th
    |   (?<=^2[nN][dD])         # second
    |   (?<=[^1]2[nN][dD])      # second but not 12th
    |   (?<=^3[rR][dD])         # third
    |   (?<=[^1]3[rR][dD])      # third but not 13th
    |   (?<=1[123][tT][hH])     # 11th - 13th
    |   (?<=[04-9][tT][hH])     # other ordinals
            # non-ordinal digit-letter boundaries
    |   (?<=^1)(?=[a-zA-Z])(?![sS][tT])       # digit-letter but not first
    |   (?<=[^1]1)(?=[a-zA-Z])(?![sS][tT])    # digit-letter but not 11th
    |   (?<=^2)(?=[a-zA-Z])(?![nN][dD])       # digit-letter but not first
    |   (?<=[^1]2)(?=[a-zA-Z])(?![nN][dD])    # digit-letter but not 12th
    |   (?<=^3)(?=[a-zA-Z])(?![rR][dD])       # digit-letter but not first
    |   (?<=[^1]3)(?=[a-zA-Z])(?![rR][dD])    # digit-letter but not 13th
    |   (?<=1[123])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not 11th - 13th
    |   (?<=[04-9])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not ordinal
    |   (?=$)                               # end of string
    )
''', re.X)

matcher=re.compile（r''
(.+?)
（？：#识别令牌边界
（？=[^a-zA-Z0-9]）#下一个字符不是单词字符
|（？=[A-Z][A-Z]）#接下来的两个字符是上下两个字符
|（？对于Python，我认为\G
锚定在实验性的regex
模块中。仅供参考-许多regex子表达式可以组合成一个表达式。你是什么意思，它们可以组合？你能给我一个具体的例子吗？让我测试一下。马上，就有一个问题。这个（[a-zA-Z0-9]*？）
将不匹配任何内容（如果可以）。由于您的某些令牌边界断言不满足其匹配的内容，因此引擎将在不消耗任何内容的情况下将搜索位置增加1个字符。这将尽可能发生，直到除（？=$）
之外的任何断言都不满足，并强制（[a-zA-Z0-9]*？）以匹配剩余的字母/数字。此代码在Perl中工作。我不是在寻求有关正则表达式的帮助，只是在Python中如何实现这种惯用编程风格方面的帮助
def tokenize(w):
    index = 0
    m = matcher.match(w, index)
    o = []
    # Although index != m.end() check zero-length match, it's more of
    # a guard against accidental infinite loop.
    # Don't expect a regex which can match empty string to work.
    # See Caveat section.
    while m and index != m.end():
        o.append(m.group(1))
        index = m.end()
        m = matcher.match(w, index)
    return o

matcher = re.compile(r'''
    (.+?)
    (?:               # identify the token boundary
        (?=[^a-zA-Z0-9])       # next character is not a word character 
    |   (?=[A-Z][a-z])         # Next two characters are upper lower
    |   (?<=[a-z])(?=[A-Z])    # lower followed by upper
    |   (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
            # ordinal boundaries
    |   (?<=^1[sS][tT])         # first
    |   (?<=[^1][1][sS][tT])    # first but not 11th
    |   (?<=^2[nN][dD])         # second
    |   (?<=[^1]2[nN][dD])      # second but not 12th
    |   (?<=^3[rR][dD])         # third
    |   (?<=[^1]3[rR][dD])      # third but not 13th
    |   (?<=1[123][tT][hH])     # 11th - 13th
    |   (?<=[04-9][tT][hH])     # other ordinals
            # non-ordinal digit-letter boundaries
    |   (?<=^1)(?=[a-zA-Z])(?![sS][tT])       # digit-letter but not first
    |   (?<=[^1]1)(?=[a-zA-Z])(?![sS][tT])    # digit-letter but not 11th
    |   (?<=^2)(?=[a-zA-Z])(?![nN][dD])       # digit-letter but not first
    |   (?<=[^1]2)(?=[a-zA-Z])(?![nN][dD])    # digit-letter but not 12th
    |   (?<=^3)(?=[a-zA-Z])(?![rR][dD])       # digit-letter but not first
    |   (?<=[^1]3)(?=[a-zA-Z])(?![rR][dD])    # digit-letter but not 13th
    |   (?<=1[123])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not 11th - 13th
    |   (?<=[04-9])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not ordinal
    |   (?=$)                               # end of string
    )
''', re.X)