Python 拆分括号分隔的文本，该文本可以包含带引号的字符串_Python_Regex_Pyparsing_State Machine

Python 拆分括号分隔的文本，该文本可以包含带引号的字符串

python regex

Python 拆分括号分隔的文本，该文本可以包含带引号的字符串,python,regex,pyparsing,state-machine,Python,Regex,Pyparsing,State Machine,我正在尝试拆分一些文本。基本上，我想分开一级括号，比如“（'1'，'a'，NULL），（2，'b'）”=>[（'1'，'a'，NULL）”，“（2，'b'）]”，但我需要知道里面可能有带引号的字符串。它至少需要满足以下py.tests： from splitter import split_text def test_normal(): assert split_text("('1'),('2')") == ["('1')", "('2')"] assert split_te

我正在尝试拆分一些文本。基本上，我想分开一级括号，比如

“（'1'，'a'，NULL），（2，'b'）”

[（'1'，'a'，NULL）”，“（2，'b'）]”

，但我需要知道里面可能有带引号的字符串。它至少需要满足以下py.tests：

from splitter import split_text


def test_normal():
    assert split_text("('1'),('2')") == ["('1')", "('2')"]
    assert split_text("(1),(2),(3)") == ["(1)", "(2)", "(3)"]


def test_complex():
    assert split_text("('1','a'),('2','b')") == ["('1','a')", "('2','b')"]
    assert split_text("('1','a',NULL),(2,'b')") == ["('1','a',NULL)", "(2,'b')"]


def test_apostrophe():
    assert split_text("('\\'1','a'),('2','b')") == ["('\\'1','a')", "('2','b')"]


def test_coma_in_string():
    assert split_text("('1','a,c'),('2','b')") == ["('1','a,c')", "('2','b')"]


def test_bracket_in_string():
    assert split_text("('1','a)c'),('2','b')") == ["('1','a)c')", "('2','b')"]


def test_bracket_and_coma_in_string():
    assert split_text("('1','a),(c'),('2','b')") == ["('1','a),(c')", "('2','b')"]


def test_bracket_and_coma_in_string_apostrophe():
    assert split_text("('1','a\\'),(c'),('2','b')") == ["('1','a\\'),(c')", "('2','b')"]

我尝试了以下方法：

1）正则表达式
这看起来是最好的解决方案，但不幸的是，我没有找到任何满足所有测试的方法
我的最佳尝试是：

def split_text(text): return re.split('(?<=\)),(?=\()', text)
它工作正常，通过了所有测试，但速度非常慢
3）pyparsing

from pyparsing import QuotedString, ZeroOrMore, Literal, Group, Suppress, Word, nums null_value = Literal('NULL') number_value = Word(nums) string_value = QuotedString("'", escChar='\\', unquoteResults=False) value = null_value | number_value | string_value one_bracket = Group(Literal('(') + value + ZeroOrMore(Literal(',') + value) + Literal(')')) all_brackets = one_bracket + ZeroOrMore(Suppress(',') + one_bracket) def split_text(text): parse_result = all_brackets.parseString(text) return [''.join(a) for a in parse_result]
也通过了所有测试，但令人惊讶的是，它比解决方案2更慢

如何使解决方案快速、可靠？我有一种感觉，我错过了一些明显的东西。
我做了这个，它在给定的测试中有效

tests = ["('1'),('2')", "(1),(2),(3)", "('1','a'),('2','b')", "('1','a',NULL),(2,'b')", "('\\'1','a'),('2','b')", "('1','a,c'),('2','b')", "('1','a)c'),('2','b')", "('1','a),(c'),('2','b')", "('1','a\\'),(c'),('2','b')"] for text in tests: tmp = '' res = [] bracket = 0 quote = False for idx,i in enumerate(text): if i=="'": if text[idx-1]!='\\': quote = not quote tmp += i elif quote: tmp += i elif i==',': if bracket: tmp += i else: pass else: if i=='(': bracket += 1 elif i==')': bracket -= 1 if bracket: tmp += i else: tmp += i res.append(tmp) tmp = '' print res
输出：

["('1')", "('2')"] ['(1)', '(2)', '(3)'] ["('1','a')", "('2','b')"] ["('1','a',NULL)", "(2,'b')"] ["('\\'1','a')", "('2','b')"] ["('1','a,c')", "('2','b')"] ["('1','a)c')", "('2','b')"] ["('1','a),(c')", "('2','b')"] ["('1','a\\'),(c')", "('2','b')"]

代码还有改进的余地，欢迎编辑。：）
一种方法是使用支持
（*跳过）（*失败）
功能的较新模块：

import regex as re def split_text(text): rx = r"""'.*?(?<!\\)'(*SKIP)(*FAIL)|(?<=\)),(?=\()""" return re.split(rx, text)

将regex作为re导入 def拆分_文本（文本）： rx=r”“”.*（？这是一个正则表达式，它似乎可以工作并通过所有测试。在实际数据上运行它比在Python中实现的有限状态机快6倍左右 PATTERN = re.compile( r""" \( # Opening bracket (?: # String (?:'(?: (?:\\')|[^'] # Either escaped apostrophe, or other character )*' ) | # or other literal not containing right bracket [^')] ) (?:, # Zero or more of them separated with comma following the first one # String (?:'(?: (?:\\')|[^'] # Either escaped apostrophe, or other character )*' ) | # or other literal [^')] )* \) # Closing bracket """, re.VERBOSE) def split_text(text): return PATTERN.findall(text) 你认为它会比解决方案2和3快吗？因此，我认为它或多或少是FSM，真实数据的速度与解决方案2相当。解决方案是O（n）对于每个字符串，其中n是字符串的长度。因此，是的，这是最快的，因为您需要至少扫描一次字符串。与解决方案#2相比，我认为我的更短。从方法上看，它们是相似的，并且两者都是O（n）。运行时间的差异（如果有的话）可能是因为实现的不同。顺便说一句，你能分享一下你是如何决定一个算法是否慢的吗？有趣的是，我不知道这个库。出于一些原因，我不能使用它，但会关注它。谢谢！+1 '.*?(?<!\\)' # look for a single quote up to a new single quote # that MUST NOT be escaped (thus the neg. lookbehind) (*SKIP)(*FAIL)| # these parts shall fail (?<=\)),(?=\() # your initial pattern with a positive lookbehind/ahead PATTERN = re.compile( r""" \( # Opening bracket (?: # String (?:'(?: (?:\\')|[^'] # Either escaped apostrophe, or other character )*' ) | # or other literal not containing right bracket [^')] ) (?:, # Zero or more of them separated with comma following the first one # String (?:'(?: (?:\\')|[^'] # Either escaped apostrophe, or other character )*' ) | # or other literal [^')] )* \) # Closing bracket """, re.VERBOSE) def split_text(text): return PATTERN.findall(text)