Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 替换与正则表达式匹配的相邻相同标记_Python_Regex - Fatal编程技术网

Python 替换与正则表达式匹配的相邻相同标记

Python 替换与正则表达式匹配的相邻相同标记,python,regex,Python,Regex,在python应用程序中,我需要替换与正则表达式匹配的相邻相同的空格分隔标记,例如,对于“a\w\w”等模式 编辑 我上面的例子没有明确说明不匹配正则表达式的令牌不应该聚合。一个更好的例子是 "xyz xyz abc abc zzq ak9 ak9 ak9 foo foo abc" --> "xyz xyz abc*2 zzq ak9*3 foo foo bar abc" 结束编辑 我有工作代码张贴在下面,但它似乎比它应该更复杂 我不是在寻找一轮代码高尔夫,但我对使用性能类似的标准Py

在python应用程序中,我需要替换与正则表达式匹配的相邻相同的空格分隔标记,例如,对于“a\w\w”等模式

编辑

我上面的例子没有明确说明不匹配正则表达式的令牌不应该聚合。一个更好的例子是

"xyz xyz abc abc zzq ak9 ak9 ak9 foo foo abc" 
--> "xyz xyz abc*2 zzq ak9*3 foo foo bar abc"
结束编辑

我有工作代码张贴在下面,但它似乎比它应该更复杂

我不是在寻找一轮代码高尔夫,但我对使用性能类似的标准Python库的可读性更高的解决方案感兴趣

在我的应用程序中,可以安全地假设输入字符串的长度小于10000个字符,并且任何给定的字符串都只包含少数(例如<10个)与模式匹配的可能字符串

import re

def fm_pattern_factory(ptnstring):
    """
    Return a regex that matches two or more occurrences 
    of ptnstring separated by whitespace.
    >>> fm_pattern_factory('abc').match(' abc abc ') is None
    False
    >>> fm_pattern_factory('abc').match('abc') is None
    True
    """
    ptn = r"\s*({}(?:\s+{})+)\s*".format(ptnstring, ptnstring)
    return re.compile(ptn)

def fm_gather(target, ptnstring):
    """
    Replace adjacent occurences of ptnstring in target with
    ptnstring*N where n is the number occurrences.
    >>> fm_gather('xyz abc abc def abc', 'abc')
    'xyz abc*2 def abc'
    >>> fm_gather('xyz abc abc def abc abc abc qrs', 'abc')
    'xyz abc*2 def abc*3 qrs'
    """
    ptn = fm_pattern_factory(ptnstring)
    result = []
    index = 0
    for match in ptn.finditer(target):
        result.append(target[index:match.start()+1])
        repl = "{}*{}".format(ptnstring, match.group(1).count(ptnstring))
        result.append(repl)
        index = match.end() - 1

    result.append(target[index:])
    return "".join(result)

def fm_gather_all(target, ptn):
    """ 
    Apply fm_gather() to all distinct matches for ptn.
    >>> s = "x abc abc y abx abx z acq"
    >>> ptn = re.compile(r"a..")
    >>> fm_gather_all(s, ptn)
    'x abc*2 y abx*2 z acq'
    """
    ptns = set(ptn.findall(target))
    for p in ptns:
        target = fm_gather(target, p)
    return "".join(target)

对不起,在看到你第一次发表评论之前,我正在准备答案。如果这不能回答您的问题,请告诉我,我将删除它或尝试相应地修改它

对于问题中提供的简单输入(下面代码中的内容存储在
my_string
变量中),您可以尝试另一种方法:遍历输入列表并保留一个
的“桶”:

这将产生:

my_splitted_string is a <type 'list'> now containing: ['xyz', 'abc', 'abc', 'zzq', 'ak9', 'ak9', 'ak9', 'foo', 'abc']
Does abc match xyz?
It doesn't. Creating a new 'bucket'
Does abc match abc?
It does. Aggregating
Does zzq match abc?
It doesn't. Creating a new 'bucket'
Does ak9 match zzq?
It doesn't. Creating a new 'bucket'
Does ak9 match ak9?
It does. Aggregating
Does ak9 match ak9?
It does. Aggregating
Does foo match ak9?
It doesn't. Creating a new 'bucket'
Does abc match foo?
It doesn't. Creating a new 'bucket'
Collected occurrences: [['xyz', 1], ['abc', 2], ['zzq', 1], ['ak9', 3], ['foo', 1], ['abc', 1]]
Compressed string: 'xyz abc*2 zzq ak9*3 foo abc '
my_splitted_字符串现在包含:['xyz','abc','abc','zzq','ak9','ak9','ak9','foo','abc']
abc和xyz匹配吗?
没有。创建新的“桶”
abc和abc匹配吗?
是的。聚合
zzq和abc匹配吗?
没有。创建新的“桶”
ak9和zzq匹配吗?
没有。创建新的“桶”
ak9和ak9匹配吗?
是的。聚合
ak9和ak9匹配吗?
是的。聚合
foo和ak9匹配吗?
没有。创建新的“桶”
abc和foo匹配吗?
没有。创建新的“桶”
收集的事件:['xyz',1],'abc',2],'zzq',1],'ak9',3],'foo',1],'abc',1]]
压缩字符串:“xyz abc*2 zzq ak9*3 foo abc”

(注意最后的空白)

以下内容似乎很可靠,在我的应用程序中表现良好。感谢BorrajaX的回答,他指出了不经常扫描输入字符串的好处

下面的函数还保留输出中的换行符和空格。我忘了在我的问题中指出这一点,但事实证明,这在我的应用程序中是可取的,它需要生成一些人类可读的中间输出

def gather_token_sequences(masterptn, target):
    """
    Find all sequences in 'target' of two or more identical adjacent tokens
    that match 'masterptn'.  Count the number of tokens in each sequence.
    Return a new version of 'target' with each sequence replaced by one token
    suffixed with '*N' where N is the count of tokens in the sequence.
    Whitespace in the input is preserved (except where consumed within replaced
    sequences).

    >>> mptn = r'ab\w'
    >>> tgt = 'foo abc abc'
    >>> gather_token_sequences(mptn, tgt)
    'foo abc*2'

    >>> tgt = 'abc abc '
    >>> gather_token_sequences(mptn, tgt)
    'abc*2 '

    >>> tgt = '\\nabc\\nabc abc\\ndef\\nxyz abx\\nabx\\nxxx abc'
    >>> gather_token_sequences(mptn, tgt)
    '\\nabc*3\\ndef\\nxyz abx*2\\nxxx abc'
    """

    # Emulate python's strip() function except that the leading and trailing
    # whitespace are captured for final output. This guarantees that the
    # body of the remaining string will start and end with a token, which
    # slightly simplifies the subsequent matching loops.
    stripped = re.match(r'^(\s*)(\S.*\S)(\s*)$', target, flags=re.DOTALL)
    head, body, tail = stripped.groups()

    # Init the result list and loop variables.
    result = [head]
    i = 0
    token = None
    while i < len(body):
        ## try to match master pattern
        match = re.match(masterptn, body[i:])
        if match is None:
            ## Append char and advance.
            result += body[i]
            i += 1

        else:
            ## Start new token sequence
            token = match.group(0)
            esc = re.escape(token) # might have special chars in token
            ptn = r"((?:{}\s+)+{})".format(esc, esc)
            seq = re.match(ptn, body[i:])
            if seq is None: # token is not repeated.
                result.append(token)
                i += len(token)
            else:
                seqstring = seq.group(0)
                replacement = "{}*{}".format(token, seqstring.count(token))
                result.append(replacement)
                i += len(seq.group(0))

    result.append(tail)
    return ''.join(result)               
def聚集令牌序列(主PTN,目标):
"""
查找两个或多个相同相邻标记的“目标”中的所有序列
匹配“masterptn”的。计算每个序列中的令牌数。
返回“target”的新版本,每个序列由一个令牌替换
后缀为“*N”,其中N是序列中令牌的计数。
将保留输入中的空白(在替换中使用的空白除外)
序列)。
>>>mptn=r'ab\w'
>>>tgt='foo abc'
>>>收集令牌序列(mptn、tgt)
‘foo abc*2’
>>>tgt='abc'
>>>收集令牌序列(mptn、tgt)
“abc*2”
>>>tgt='\\nabc\\nabc abc\\ndef\\nxyz abx\\nabx\\nxxx abc'
>>>收集令牌序列(mptn、tgt)
'\\nabc*3\\ndef\\nxyz abx*2\\nxxx abc'
"""
#模拟python的strip()函数,除了
#为最终输出捕获空白。这保证了
#剩余字符串的主体将以标记开始和结束,标记
#稍微简化了后续的匹配循环。
stripped=re.match(r'^(\s*)(\s.*\s)(\s*)$),target,flags=re.DOTALL)
头、体、尾=剥离。组()
#初始化结果列表和循环变量。
结果=[头]
i=0
令牌=无
而我(身体):
##尝试匹配主模式
匹配=重新匹配(主PTN,主体[i:]
如果匹配为“无”:
##追加字符并前进。
结果+=正文[i]
i+=1
其他:
##启动新令牌序列
令牌=匹配。组(0)
esc=re.escape(令牌)#令牌中可能有特殊字符
ptn=r“((?:{}\s++{})”格式(esc,esc)
序号=重新匹配(ptn,正文[i:]
如果seq为None:#标记不重复。
result.append(标记)
i+=len(令牌)
其他:
seqstring=序列组(0)
replacement=“{}*{}”。格式(令牌,seqstring.count(令牌))
结果.追加(替换)
i+=len(序列组(0))
结果追加(尾部)
返回“”。加入(结果)

您需要使用regexp吗?@BorrajaX是的,我想需要。要匹配的实际模式是LilyPond全尺寸支架,其形式类似于“R1*3/4”或“s1*13/16”。如果你好奇的话,这个项目是在GitHub上的,在我的问题中我应该很清楚,输入字符串可能包含不应该聚合的重复标记。我将对问题进行编辑,使其更加明显。感谢您尝试一下。如果平等测试与测试结合使用,以查看正则表达式是否与当前单词匹配,那么您的解决方案将非常有效。我正在投票,并将在一天左右接受它,除非有人提出一个解决方案,使我们的两种方法都泡汤。。。我看不到测试:-)是的,是的。您可以始终回答自己的问题,并选择您的答案作为所选答案。也许这会对其他读者有所帮助?由你决定:-)
my_splitted_string is a <type 'list'> now containing: ['xyz', 'abc', 'abc', 'zzq', 'ak9', 'ak9', 'ak9', 'foo', 'abc']
Does abc match xyz?
It doesn't. Creating a new 'bucket'
Does abc match abc?
It does. Aggregating
Does zzq match abc?
It doesn't. Creating a new 'bucket'
Does ak9 match zzq?
It doesn't. Creating a new 'bucket'
Does ak9 match ak9?
It does. Aggregating
Does ak9 match ak9?
It does. Aggregating
Does foo match ak9?
It doesn't. Creating a new 'bucket'
Does abc match foo?
It doesn't. Creating a new 'bucket'
Collected occurrences: [['xyz', 1], ['abc', 2], ['zzq', 1], ['ak9', 3], ['foo', 1], ['abc', 1]]
Compressed string: 'xyz abc*2 zzq ak9*3 foo abc '
def gather_token_sequences(masterptn, target):
    """
    Find all sequences in 'target' of two or more identical adjacent tokens
    that match 'masterptn'.  Count the number of tokens in each sequence.
    Return a new version of 'target' with each sequence replaced by one token
    suffixed with '*N' where N is the count of tokens in the sequence.
    Whitespace in the input is preserved (except where consumed within replaced
    sequences).

    >>> mptn = r'ab\w'
    >>> tgt = 'foo abc abc'
    >>> gather_token_sequences(mptn, tgt)
    'foo abc*2'

    >>> tgt = 'abc abc '
    >>> gather_token_sequences(mptn, tgt)
    'abc*2 '

    >>> tgt = '\\nabc\\nabc abc\\ndef\\nxyz abx\\nabx\\nxxx abc'
    >>> gather_token_sequences(mptn, tgt)
    '\\nabc*3\\ndef\\nxyz abx*2\\nxxx abc'
    """

    # Emulate python's strip() function except that the leading and trailing
    # whitespace are captured for final output. This guarantees that the
    # body of the remaining string will start and end with a token, which
    # slightly simplifies the subsequent matching loops.
    stripped = re.match(r'^(\s*)(\S.*\S)(\s*)$', target, flags=re.DOTALL)
    head, body, tail = stripped.groups()

    # Init the result list and loop variables.
    result = [head]
    i = 0
    token = None
    while i < len(body):
        ## try to match master pattern
        match = re.match(masterptn, body[i:])
        if match is None:
            ## Append char and advance.
            result += body[i]
            i += 1

        else:
            ## Start new token sequence
            token = match.group(0)
            esc = re.escape(token) # might have special chars in token
            ptn = r"((?:{}\s+)+{})".format(esc, esc)
            seq = re.match(ptn, body[i:])
            if seq is None: # token is not repeated.
                result.append(token)
                i += len(token)
            else:
                seqstring = seq.group(0)
                replacement = "{}*{}".format(token, seqstring.count(token))
                result.append(replacement)
                i += len(seq.group(0))

    result.append(tail)
    return ''.join(result)