Python 正则表达式:替换文本,除非它位于引号之间

Python 正则表达式:替换文本,除非它位于引号之间,python,regex,Python,Regex,我正在开发一个transpiler,希望用Python的标记替换我的语言标记。替换是这样进行的: for rep in reps: pattern, translated = rep; # Replaces every [pattern] with [translated] in [transpiled] transpiled = re.sub(pattern, translated, transpiled, flags=re.UNICODE) 其中,reps是(要替

我正在开发一个transpiler,希望用Python的标记替换我的语言标记。替换是这样进行的:

for rep in reps:
    pattern, translated = rep;

    # Replaces every [pattern] with [translated] in [transpiled]
    transpiled = re.sub(pattern, translated, transpiled, flags=re.UNICODE)
其中,
reps
(要替换的正则表达式,要替换的字符串)
有序对的列表,
transpiled
是要传输的文本


然而,我似乎找不到一种方法从替换过程中排除引号之间的文本。请注意,这是针对一种语言的,因此它也适用于转义引号和单引号。

这可能取决于您如何定义模式,但一般来说,您可以始终使用前向和后向组包围您的
模式,以确保引号之间的文本不匹配:

import re

transpiled = "A foo with \"foo\" and single quoted 'foo'. It even has an escaped \\'foo\\'!"

reps = [("foo", "bar"), ("and", "or")]

print(transpiled)  # before the changes

for rep in reps:
    pattern, translated = rep
    transpiled = re.sub("(?<=[^\"']){}(?=\\\\?[^\"'])".format(pattern),
                        translated, transpiled, flags=re.UNICODE)
    print(transpiled)  # after each change
另外,这还允许您将完全限定的正则表达式模式定义为替换模式:

print(replace_non_quoted("My foo and \"bar\" are like 'moo' and star!",
                        (("(\w+)oo", "oo\\1"), ("(\w+)ar", "ra\\1"))))
# My oof and "bar" are like 'moo' and rast!
但是如果您的替换不涉及模式并且只需要简单的替换,那么您可以将
replace\u multiple()
helper函数中的
re.sub()
替换为速度显著更快的本机
str.replace()

最后,如果不需要复杂的模式,可以完全去掉正则表达式:

QUOTE_STRINGS = ("'", "\\'", '"', '\\"')  # a list of substring considered a 'quote'

def replace_multiple(source, replacements):  # a convenience multi-replacement function
    if not source:  # no need to process empty strings
        return ""
    for r in replacements:
        source = source.replace(r[0], r[1])
    return source

def replace_non_quoted(source, replacements):
    result = []  # a store for the result pieces
    head = 0  # a search head reference
    eos = len(source)  # a convenience string length reference
    quote = None  # last quote match literal
    quote_len = 0  # a convenience reference to the current quote substring length
    while True:
        if quote:  # we already have a matching quote stored
            index = source.find(quote, head + quote_len)  # find the closing quote
            if index == -1:  # EOS reached
                break
            result.append(source[head:index + quote_len])  # add the quoted string verbatim
            head = index + quote_len  # move the search head after the quoted match
            quote = None  # blank out the quote literal
        else:  # the current position is not in a quoted substring
            index = eos
            # find the first quoted substring from the current head position
            for entry in QUOTE_STRINGS:  # loop through all quote substrings
                candidate = source.find(entry, head)
                if head < candidate < index:
                    index = candidate
                    quote = entry
                    quote_len = len(entry)
            if not quote:  # EOS reached, no quote found
                break
            result.append(replace_multiple(source[head:index], replacements))
            head = index  # move the search head to the start of the quoted match
    if head < eos:  # if the search head is not at the end of the string
        result.append(replace_multiple(source[head:], replacements))
    return "".join(result)  # join back the result pieces and return them
QUOTE\u STRINGS=(“'”、“\\”、““'”、“\\”)\子字符串列表被视为“QUOTE”
def replace_multiple(源,replacements):#一种方便的多重替换功能
如果不是源:#不需要处理空字符串
返回“”
对于替换中的r:
source=source.replace(r[0],r[1])
返回源
def replace_非报价(来源,替换):
结果=[]#结果块的存储
head=0#搜索头引用
eos=len(源)#一个方便的字符串长度参考
quote=None#最后一个quote与文字匹配
quote_len=0#对当前quote子字符串长度的方便引用
尽管如此:
if quote:#我们已经存储了一个匹配的quote
index=source.find(引号,head+quote_len)#查找结束引号
如果索引==-1:#达到EOS
打破
result.append(source[head:index+quote_len])#逐字添加带引号的字符串
head=index+quote#len#在引用的匹配之后移动搜索头
quote=None#清空quote文字
else:#当前位置不在带引号的子字符串中
指数=eos
#从当前头部位置查找第一个引用的子字符串
对于QUOTE#字符串中的条目:#循环所有QUOTE子字符串
候选人=来源。查找(条目,标题)
如果头部<候选者<索引:
索引=候选人
quote=条目
quote_len=len(条目)
如果没有报价:#已达到EOS,未找到报价
打破
结果.append(replace_multiple(源[头:索引],replaces))
head=index#将搜索头移动到引用匹配的开始处
如果头
您可能希望使用Python的内置模块,而不仅仅是使用正则表达式。它是为处理引用字符串而设计的,就像在shell中一样,包括嵌套的示例

import shlex
shlex.split("""look "nested \\"quotes\\"" here""")
# ['look', 'nested "quotes"', 'here']

我想我知道你的意思,但为了确定,你能提供一个当前和预期输出的示例输入吗?这通常会使有人更容易回答这个问题。为什么不在您的正则表达式模式中加入由
[“]”]
“]组成的前瞻/后顾组?^这可能是您想要的^此处的一些信息:谢谢,但似乎不起作用……例如:\n import re\n\n transpiled=('hey“foo and foo!”)\n\n reps=[((“foo!”)“,“bar”),(“and”,“or”)]\n\n print(transpiled)#在reps中的rep更改之前:\n pattern,translated=rep\n transpiled=re.sub((?再次感谢您花费大量时间和精力。但是,不幸的是,此解决方案似乎仍然不起作用…(稍作修改)procudes的代码
TypeError:sequence item 0:expected string,NoneType found
。我已经尝试了很长一段时间,所以您的帮助真的很重要。提前感谢!@Lucca-您从未
replace\u multiple()返回源代码
函数。另外,将
引号_STRING
保留在函数之外,这样您就不必每次运行模式时都重新编译它。是的!非常感谢!
print(replace_non_quoted("My foo and \"bar\" are like 'moo' and star!",
                        (("(\w+)oo", "oo\\1"), ("(\w+)ar", "ra\\1"))))
# My oof and "bar" are like 'moo' and rast!
QUOTE_STRINGS = ("'", "\\'", '"', '\\"')  # a list of substring considered a 'quote'

def replace_multiple(source, replacements):  # a convenience multi-replacement function
    if not source:  # no need to process empty strings
        return ""
    for r in replacements:
        source = source.replace(r[0], r[1])
    return source

def replace_non_quoted(source, replacements):
    result = []  # a store for the result pieces
    head = 0  # a search head reference
    eos = len(source)  # a convenience string length reference
    quote = None  # last quote match literal
    quote_len = 0  # a convenience reference to the current quote substring length
    while True:
        if quote:  # we already have a matching quote stored
            index = source.find(quote, head + quote_len)  # find the closing quote
            if index == -1:  # EOS reached
                break
            result.append(source[head:index + quote_len])  # add the quoted string verbatim
            head = index + quote_len  # move the search head after the quoted match
            quote = None  # blank out the quote literal
        else:  # the current position is not in a quoted substring
            index = eos
            # find the first quoted substring from the current head position
            for entry in QUOTE_STRINGS:  # loop through all quote substrings
                candidate = source.find(entry, head)
                if head < candidate < index:
                    index = candidate
                    quote = entry
                    quote_len = len(entry)
            if not quote:  # EOS reached, no quote found
                break
            result.append(replace_multiple(source[head:index], replacements))
            head = index  # move the search head to the start of the quoted match
    if head < eos:  # if the search head is not at the end of the string
        result.append(replace_multiple(source[head:], replacements))
    return "".join(result)  # join back the result pieces and return them
import shlex
shlex.split("""look "nested \\"quotes\\"" here""")
# ['look', 'nested "quotes"', 'here']