Python执行\G锚定解析循环的方式是什么?
下面是我几年前编写的一个perl函数。它是一个智能标记器,可以识别一些可能不应该粘在一起的东西。例如,给定左侧的输入,它会分割字符串,如右侧所示: 我现在正在做一些机器学习实验,我想做一些使用这个标记器的实验。但首先,我需要将它从Perl移植到Python。这段代码的关键是使用\G锚点的循环,我听说python中不存在这种情况。我曾尝试用谷歌搜索Python是如何做到这一点的,但我不确定到底要搜索什么,所以我很难找到答案 您将如何用Python编写此函数Python执行\G锚定解析循环的方式是什么?,python,regex,Python,Regex,下面是我几年前编写的一个perl函数。它是一个智能标记器,可以识别一些可能不应该粘在一起的东西。例如,给定左侧的输入,它会分割字符串,如右侧所示: 我现在正在做一些机器学习实验,我想做一些使用这个标记器的实验。但首先,我需要将它从Perl移植到Python。这段代码的关键是使用\G锚点的循环,我听说python中不存在这种情况。我曾尝试用谷歌搜索Python是如何做到这一点的,但我不确定到底要搜索什么,所以我很难找到答案 您将如何用Python编写此函数 sub Tokenize # Break
sub Tokenize
# Breaks a string into tokens using special rules,
# where a token is any sequence of characters, be they a sequence of letters,
# a sequence of numbers, or a sequence of non-alpha-numeric characters
# the list of tokens found are returned to the caller
{
my $value = shift;
my @list = ();
my $word;
while ( $value ne '' && $value =~ m/
\G # start where previous left off
([^a-zA-Z0-9]*) # capture non-alpha-numeric characters, if any
([a-zA-Z0-9]*?) # capture everything up to a token boundary
(?: # identify the token boundary
(?=[^a-zA-Z0-9]) # next character is not a word character
| (?=[A-Z][a-z]) # Next two characters are upper lower
| (?<=[a-z])(?=[A-Z]) # lower followed by upper
| (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
# ordinal boundaries
| (?<=^1(?i:st)) # first
| (?<=[^1][1](?i:st)) # first but not 11th
| (?<=^2(?i:nd)) # second
| (?<=[^1]2(?i:nd)) # second but not 12th
| (?<=^3(?i:rd)) # third
| (?<=[^1]3(?i:rd)) # third but not 13th
| (?<=1[123](?i:th)) # 11th - 13th
| (?<=[04-9](?i:th)) # other ordinals
# non-ordinal digit-letter boundaries
| (?<=^1)(?=[a-zA-Z])(?!(?i)st) # digit-letter but not first
| (?<=[^1]1)(?=[a-zA-Z])(?!(?i)st) # digit-letter but not 11th
| (?<=^2)(?=[a-zA-Z])(?!(?i)nd) # digit-letter but not first
| (?<=[^1]2)(?=[a-zA-Z])(?!(?i)nd) # digit-letter but not 12th
| (?<=^3)(?=[a-zA-Z])(?!(?i)rd) # digit-letter but not first
| (?<=[^1]3)(?=[a-zA-Z])(?!(?i)rd) # digit-letter but not 13th
| (?<=1[123])(?=[a-zA-Z])(?!(?i)th) # digit-letter but not 11th - 13th
| (?<=[04-9])(?=[a-zA-Z])(?!(?i)th) # digit-letter but not ordinal
| (?=$) # end of string
)
/xg )
{
push @list, $1 if $1 ne '';
push @list, $2 if $2 ne '';
}
return @list;
}
子标记化
#使用特殊规则将字符串拆分为标记,
#其中,标记是任意字符序列,可以是字母序列,
#数字序列或非字母数字字符序列
#找到的令牌列表将返回给调用方
{
我的$value=shift;
我的@list=();
我的话;
而($value-ne''&&$value=~m/
\G#从上一次中断的地方开始
([^a-zA-Z0-9]*)#捕获非字母数字字符(如有)
([a-zA-Z0-9]*?)#捕获所有标记边界内的内容
(?:#识别令牌边界
(?=[^a-zA-Z0-9])#下一个字符不是单词字符
|(?=[A-Z][A-Z])#接下来的两个字符是上下两个字符
|(?使用re.RegexObject.match
您可以使用re
模块模拟正则表达式开头的\G
效果,方法是跟踪并提供起始位置,从而强制匹配从pos
中的指定位置开始
def tokenize(w):
index = 0
m = matcher.match(w, index)
o = []
# Although index != m.end() check zero-length match, it's more of
# a guard against accidental infinite loop.
# Don't expect a regex which can match empty string to work.
# See Caveat section.
while m and index != m.end():
o.append(m.group(1))
index = m.end()
m = matcher.match(w, index)
return o
警告
该方法的一个警告是,它不能很好地处理与主匹配中的空字符串匹配的正则表达式,因为Python没有任何工具来强制正则表达式重试匹配,同时阻止零长度匹配
例如,re.findall(r'(.?),'abc')
返回一个由4个空字符串组成的数组['','','','',]
,而在PCRE中,您可以找到7个匹配项
其中,第2、第4和第6个匹配分别从与第1、第3和第5个匹配相同的索引开始。PCRE中的其他匹配可通过在相同索引处重试来找到,该索引带有防止空字符串匹配的标志
我知道问题是关于Perl,而不是PCRE,但是全局匹配行为应该是相同的,否则,原始代码就不可能工作
如问题中所述,将([^a-zA-Z0-9]*)([a-zA-Z0-9]*?)
重写为(.+?)
,可以避免此问题,尽管您可能希望使用此标志
关于正则表达式的其他评论
由于Python中的不区分大小写标志会影响整个模式,因此必须重写不区分大小写的子模式。我会将(?I:st)
重写为[sS][tT]
以保留原始含义,但如果这是您需求的一部分,请使用(?:st | st)
由于Python支持,您可以编写与Perl代码类似的正则表达式:
matcher = re.compile(r'''
(.+?)
(?: # identify the token boundary
(?=[^a-zA-Z0-9]) # next character is not a word character
| (?=[A-Z][a-z]) # Next two characters are upper lower
| (?<=[a-z])(?=[A-Z]) # lower followed by upper
| (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
# ordinal boundaries
| (?<=^1[sS][tT]) # first
| (?<=[^1][1][sS][tT]) # first but not 11th
| (?<=^2[nN][dD]) # second
| (?<=[^1]2[nN][dD]) # second but not 12th
| (?<=^3[rR][dD]) # third
| (?<=[^1]3[rR][dD]) # third but not 13th
| (?<=1[123][tT][hH]) # 11th - 13th
| (?<=[04-9][tT][hH]) # other ordinals
# non-ordinal digit-letter boundaries
| (?<=^1)(?=[a-zA-Z])(?![sS][tT]) # digit-letter but not first
| (?<=[^1]1)(?=[a-zA-Z])(?![sS][tT]) # digit-letter but not 11th
| (?<=^2)(?=[a-zA-Z])(?![nN][dD]) # digit-letter but not first
| (?<=[^1]2)(?=[a-zA-Z])(?![nN][dD]) # digit-letter but not 12th
| (?<=^3)(?=[a-zA-Z])(?![rR][dD]) # digit-letter but not first
| (?<=[^1]3)(?=[a-zA-Z])(?![rR][dD]) # digit-letter but not 13th
| (?<=1[123])(?=[a-zA-Z])(?![tT][hH]) # digit-letter but not 11th - 13th
| (?<=[04-9])(?=[a-zA-Z])(?![tT][hH]) # digit-letter but not ordinal
| (?=$) # end of string
)
''', re.X)
matcher=re.compile(r''
(.+?)
(?:#识别令牌边界
(?=[^a-zA-Z0-9])#下一个字符不是单词字符
|(?=[A-Z][A-Z])#接下来的两个字符是上下两个字符
|(?对于Python,我认为\G
锚定在实验性的regex
模块中。仅供参考-许多regex子表达式可以组合成一个表达式。你是什么意思,它们可以组合?你能给我一个具体的例子吗?让我测试一下。马上,就有一个问题。这个([a-zA-Z0-9]*?)
将不匹配任何内容(如果可以)。由于您的某些令牌边界断言不满足其匹配的内容,因此引擎将在不消耗任何内容的情况下将搜索位置增加1个字符。这将尽可能发生,直到除(?=$)
之外的任何断言都不满足,并强制([a-zA-Z0-9]*?)
以匹配剩余的字母/数字。此代码在Perl中工作。我不是在寻求有关正则表达式的帮助,只是在Python中如何实现这种惯用编程风格方面的帮助
def tokenize(w):
index = 0
m = matcher.match(w, index)
o = []
# Although index != m.end() check zero-length match, it's more of
# a guard against accidental infinite loop.
# Don't expect a regex which can match empty string to work.
# See Caveat section.
while m and index != m.end():
o.append(m.group(1))
index = m.end()
m = matcher.match(w, index)
return o
matcher = re.compile(r'''
(.+?)
(?: # identify the token boundary
(?=[^a-zA-Z0-9]) # next character is not a word character
| (?=[A-Z][a-z]) # Next two characters are upper lower
| (?<=[a-z])(?=[A-Z]) # lower followed by upper
| (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
# ordinal boundaries
| (?<=^1[sS][tT]) # first
| (?<=[^1][1][sS][tT]) # first but not 11th
| (?<=^2[nN][dD]) # second
| (?<=[^1]2[nN][dD]) # second but not 12th
| (?<=^3[rR][dD]) # third
| (?<=[^1]3[rR][dD]) # third but not 13th
| (?<=1[123][tT][hH]) # 11th - 13th
| (?<=[04-9][tT][hH]) # other ordinals
# non-ordinal digit-letter boundaries
| (?<=^1)(?=[a-zA-Z])(?![sS][tT]) # digit-letter but not first
| (?<=[^1]1)(?=[a-zA-Z])(?![sS][tT]) # digit-letter but not 11th
| (?<=^2)(?=[a-zA-Z])(?![nN][dD]) # digit-letter but not first
| (?<=[^1]2)(?=[a-zA-Z])(?![nN][dD]) # digit-letter but not 12th
| (?<=^3)(?=[a-zA-Z])(?![rR][dD]) # digit-letter but not first
| (?<=[^1]3)(?=[a-zA-Z])(?![rR][dD]) # digit-letter but not 13th
| (?<=1[123])(?=[a-zA-Z])(?![tT][hH]) # digit-letter but not 11th - 13th
| (?<=[04-9])(?=[a-zA-Z])(?![tT][hH]) # digit-letter but not ordinal
| (?=$) # end of string
)
''', re.X)