Python 替换与正则表达式匹配的相邻相同标记_Python_Regex

Python 替换与正则表达式匹配的相邻相同标记

python regex

Python 替换与正则表达式匹配的相邻相同标记,python,regex,Python,Regex,在python应用程序中，我需要替换与正则表达式匹配的相邻相同的空格分隔标记，例如，对于“a\w\w”等模式编辑我上面的例子没有明确说明不匹配正则表达式的令牌不应该聚合。一个更好的例子是 "xyz xyz abc abc zzq ak9 ak9 ak9 foo foo abc" --> "xyz xyz abc*2 zzq ak9*3 foo foo bar abc" 结束编辑我有工作代码张贴在下面，但它似乎比它应该更复杂我不是在寻找一轮代码高尔夫，但我对使用性能类似的标准Py

在python应用程序中，我需要替换与正则表达式匹配的相邻相同的空格分隔标记，例如，对于“a\w\w”等模式

编辑

我上面的例子没有明确说明不匹配正则表达式的令牌不应该聚合。一个更好的例子是

"xyz xyz abc abc zzq ak9 ak9 ak9 foo foo abc" 
--> "xyz xyz abc*2 zzq ak9*3 foo foo bar abc"

结束编辑

我有工作代码张贴在下面，但它似乎比它应该更复杂

我不是在寻找一轮代码高尔夫，但我对使用性能类似的标准Python库的可读性更高的解决方案感兴趣

在我的应用程序中，可以安全地假设输入字符串的长度小于10000个字符，并且任何给定的字符串都只包含少数（例如<10个）与模式匹配的可能字符串

import re

def fm_pattern_factory(ptnstring):
    """
    Return a regex that matches two or more occurrences 
    of ptnstring separated by whitespace.
    >>> fm_pattern_factory('abc').match(' abc abc ') is None
    False
    >>> fm_pattern_factory('abc').match('abc') is None
    True
    """
    ptn = r"\s*({}(?:\s+{})+)\s*".format(ptnstring, ptnstring)
    return re.compile(ptn)

def fm_gather(target, ptnstring):
    """
    Replace adjacent occurences of ptnstring in target with
    ptnstring*N where n is the number occurrences.
    >>> fm_gather('xyz abc abc def abc', 'abc')
    'xyz abc*2 def abc'
    >>> fm_gather('xyz abc abc def abc abc abc qrs', 'abc')
    'xyz abc*2 def abc*3 qrs'
    """
    ptn = fm_pattern_factory(ptnstring)
    result = []
    index = 0
    for match in ptn.finditer(target):
        result.append(target[index:match.start()+1])
        repl = "{}*{}".format(ptnstring, match.group(1).count(ptnstring))
        result.append(repl)
        index = match.end() - 1

    result.append(target[index:])
    return "".join(result)

def fm_gather_all(target, ptn):
    """ 
    Apply fm_gather() to all distinct matches for ptn.
    >>> s = "x abc abc y abx abx z acq"
    >>> ptn = re.compile(r"a..")
    >>> fm_gather_all(s, ptn)
    'x abc*2 y abx*2 z acq'
    """
    ptns = set(ptn.findall(target))
    for p in ptns:
        target = fm_gather(target, p)
    return "".join(target)

对不起，在看到你第一次发表评论之前，我正在准备答案。如果这不能回答您的问题，请告诉我，我将删除它或尝试相应地修改它

对于问题中提供的简单输入（下面代码中的内容存储在

my_string

变量中），您可以尝试另一种方法：遍历输入列表并保留一个

的“桶”：

这将产生：

my_splitted_string is a <type 'list'> now containing: ['xyz', 'abc', 'abc', 'zzq', 'ak9', 'ak9', 'ak9', 'foo', 'abc']
Does abc match xyz?
It doesn't. Creating a new 'bucket'
Does abc match abc?
It does. Aggregating
Does zzq match abc?
It doesn't. Creating a new 'bucket'
Does ak9 match zzq?
It doesn't. Creating a new 'bucket'
Does ak9 match ak9?
It does. Aggregating
Does ak9 match ak9?
It does. Aggregating
Does foo match ak9?
It doesn't. Creating a new 'bucket'
Does abc match foo?
It doesn't. Creating a new 'bucket'
Collected occurrences: [['xyz', 1], ['abc', 2], ['zzq', 1], ['ak9', 3], ['foo', 1], ['abc', 1]]
Compressed string: 'xyz abc*2 zzq ak9*3 foo abc '

my_splitted_字符串现在包含：['xyz'，'abc'，'abc'，'zzq'，'ak9'，'ak9'，'ak9'，'foo'，'abc']
abc和xyz匹配吗？
没有。创建新的“桶”
abc和abc匹配吗？
是的。聚合
zzq和abc匹配吗？
没有。创建新的“桶”
ak9和zzq匹配吗？
没有。创建新的“桶”
ak9和ak9匹配吗？
是的。聚合
ak9和ak9匹配吗？
是的。聚合
foo和ak9匹配吗？
没有。创建新的“桶”
abc和foo匹配吗？
没有。创建新的“桶”
收集的事件：['xyz'，1]，'abc'，2]，'zzq'，1]，'ak9'，3]，'foo'，1]，'abc'，1]]
压缩字符串：“xyz abc*2 zzq ak9*3 foo abc”

（注意最后的空白）

以下内容似乎很可靠，在我的应用程序中表现良好。感谢BorrajaX的回答，他指出了不经常扫描输入字符串的好处

下面的函数还保留输出中的换行符和空格。我忘了在我的问题中指出这一点，但事实证明，这在我的应用程序中是可取的，它需要生成一些人类可读的中间输出

def gather_token_sequences(masterptn, target):
    """
    Find all sequences in 'target' of two or more identical adjacent tokens
    that match 'masterptn'.  Count the number of tokens in each sequence.
    Return a new version of 'target' with each sequence replaced by one token
    suffixed with '*N' where N is the count of tokens in the sequence.
    Whitespace in the input is preserved (except where consumed within replaced
    sequences).

    >>> mptn = r'ab\w'
    >>> tgt = 'foo abc abc'
    >>> gather_token_sequences(mptn, tgt)
    'foo abc*2'

    >>> tgt = 'abc abc '
    >>> gather_token_sequences(mptn, tgt)
    'abc*2 '

    >>> tgt = '\\nabc\\nabc abc\\ndef\\nxyz abx\\nabx\\nxxx abc'
    >>> gather_token_sequences(mptn, tgt)
    '\\nabc*3\\ndef\\nxyz abx*2\\nxxx abc'
    """

    # Emulate python's strip() function except that the leading and trailing
    # whitespace are captured for final output. This guarantees that the
    # body of the remaining string will start and end with a token, which
    # slightly simplifies the subsequent matching loops.
    stripped = re.match(r'^(\s*)(\S.*\S)(\s*)$', target, flags=re.DOTALL)
    head, body, tail = stripped.groups()

    # Init the result list and loop variables.
    result = [head]
    i = 0
    token = None
    while i < len(body):
        ## try to match master pattern
        match = re.match(masterptn, body[i:])
        if match is None:
            ## Append char and advance.
            result += body[i]
            i += 1

        else:
            ## Start new token sequence
            token = match.group(0)
            esc = re.escape(token) # might have special chars in token
            ptn = r"((?:{}\s+)+{})".format(esc, esc)
            seq = re.match(ptn, body[i:])
            if seq is None: # token is not repeated.
                result.append(token)
                i += len(token)
            else:
                seqstring = seq.group(0)
                replacement = "{}*{}".format(token, seqstring.count(token))
                result.append(replacement)
                i += len(seq.group(0))

    result.append(tail)
    return ''.join(result)

def聚集令牌序列（主PTN，目标）：
"""
查找两个或多个相同相邻标记的“目标”中的所有序列
匹配“masterptn”的。计算每个序列中的令牌数。
返回“target”的新版本，每个序列由一个令牌替换
后缀为“*N”，其中N是序列中令牌的计数。
将保留输入中的空白（在替换中使用的空白除外）
序列）。
>>>mptn=r'ab\w'
>>>tgt='foo abc'
>>>收集令牌序列（mptn、tgt）
‘foo abc*2’
>>>tgt='abc'
>>>收集令牌序列（mptn、tgt）
“abc*2”
>>>tgt='\\nabc\\nabc abc\\ndef\\nxyz abx\\nabx\\nxxx abc'
>>>收集令牌序列（mptn、tgt）
'\\nabc*3\\ndef\\nxyz abx*2\\nxxx abc'
"""
#模拟python的strip（）函数，除了
#为最终输出捕获空白。这保证了
#剩余字符串的主体将以标记开始和结束，标记
#稍微简化了后续的匹配循环。
stripped=re.match（r'^（\s*）（\s.*\s）（\s*）$），target，flags=re.DOTALL）
头、体、尾=剥离。组（）
#初始化结果列表和循环变量。
结果=[头]
i=0
令牌=无
而我（身体）：
##尝试匹配主模式
匹配=重新匹配（主PTN，主体[i:]
如果匹配为“无”：
##追加字符并前进。
结果+=正文[i]
i+=1
其他：
##启动新令牌序列
令牌=匹配。组（0）
esc=re.escape（令牌）#令牌中可能有特殊字符
ptn=r“（（？：{}\s++{}）”格式（esc，esc）
序号=重新匹配（ptn，正文[i:]
如果seq为None:#标记不重复。
result.append（标记）
i+=len（令牌）
其他：
seqstring=序列组（0）
replacement=“{}*{}”。格式（令牌，seqstring.count（令牌））
结果.追加（替换）
i+=len（序列组（0））
结果追加（尾部）
返回“”。加入（结果）

您需要使用regexp吗？@BorrajaX是的，我想需要。要匹配的实际模式是LilyPond全尺寸支架，其形式类似于“R1*3/4”或“s1*13/16”。如果你好奇的话，这个项目是在GitHub上的，在我的问题中我应该很清楚，输入字符串可能包含不应该聚合的重复标记。我将对问题进行编辑，使其更加明显。感谢您尝试一下。如果平等测试与测试结合使用，以查看正则表达式是否与当前单词匹配，那么您的解决方案将非常有效。我正在投票，并将在一天左右接受它，除非有人提出一个解决方案，使我们的两种方法都泡汤。。。我看不到测试：-）是的，是的。您可以始终回答自己的问题，并选择您的答案作为所选答案。也许这会对其他读者有所帮助？由你决定：-）

my_splitted_string is a <type 'list'> now containing: ['xyz', 'abc', 'abc', 'zzq', 'ak9', 'ak9', 'ak9', 'foo', 'abc']
Does abc match xyz?
It doesn't. Creating a new 'bucket'
Does abc match abc?
It does. Aggregating
Does zzq match abc?
It doesn't. Creating a new 'bucket'
Does ak9 match zzq?
It doesn't. Creating a new 'bucket'
Does ak9 match ak9?
It does. Aggregating
Does ak9 match ak9?
It does. Aggregating
Does foo match ak9?
It doesn't. Creating a new 'bucket'
Does abc match foo?
It doesn't. Creating a new 'bucket'
Collected occurrences: [['xyz', 1], ['abc', 2], ['zzq', 1], ['ak9', 3], ['foo', 1], ['abc', 1]]
Compressed string: 'xyz abc*2 zzq ak9*3 foo abc '

def gather_token_sequences(masterptn, target):
    """
    Find all sequences in 'target' of two or more identical adjacent tokens
    that match 'masterptn'.  Count the number of tokens in each sequence.
    Return a new version of 'target' with each sequence replaced by one token
    suffixed with '*N' where N is the count of tokens in the sequence.
    Whitespace in the input is preserved (except where consumed within replaced
    sequences).

    >>> mptn = r'ab\w'
    >>> tgt = 'foo abc abc'
    >>> gather_token_sequences(mptn, tgt)
    'foo abc*2'

    >>> tgt = 'abc abc '
    >>> gather_token_sequences(mptn, tgt)
    'abc*2 '

    >>> tgt = '\\nabc\\nabc abc\\ndef\\nxyz abx\\nabx\\nxxx abc'
    >>> gather_token_sequences(mptn, tgt)
    '\\nabc*3\\ndef\\nxyz abx*2\\nxxx abc'
    """

    # Emulate python's strip() function except that the leading and trailing
    # whitespace are captured for final output. This guarantees that the
    # body of the remaining string will start and end with a token, which
    # slightly simplifies the subsequent matching loops.
    stripped = re.match(r'^(\s*)(\S.*\S)(\s*)$', target, flags=re.DOTALL)
    head, body, tail = stripped.groups()

    # Init the result list and loop variables.
    result = [head]
    i = 0
    token = None
    while i < len(body):
        ## try to match master pattern
        match = re.match(masterptn, body[i:])
        if match is None:
            ## Append char and advance.
            result += body[i]
            i += 1

        else:
            ## Start new token sequence
            token = match.group(0)
            esc = re.escape(token) # might have special chars in token
            ptn = r"((?:{}\s+)+{})".format(esc, esc)
            seq = re.match(ptn, body[i:])
            if seq is None: # token is not repeated.
                result.append(token)
                i += len(token)
            else:
                seqstring = seq.group(0)
                replacement = "{}*{}".format(token, seqstring.count(token))
                result.append(replacement)
                i += len(seq.group(0))

    result.append(tail)
    return ''.join(result)