Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 拆分括号分隔的文本,该文本可以包含带引号的字符串_Python_Regex_Pyparsing_State Machine - Fatal编程技术网

Python 拆分括号分隔的文本,该文本可以包含带引号的字符串

Python 拆分括号分隔的文本,该文本可以包含带引号的字符串,python,regex,pyparsing,state-machine,Python,Regex,Pyparsing,State Machine,我正在尝试拆分一些文本。基本上,我想分开一级括号,比如“('1','a',NULL),(2,'b')”=>[('1','a',NULL)”,“(2,'b')]”,但我需要知道里面可能有带引号的字符串。它至少需要满足以下py.tests: from splitter import split_text def test_normal(): assert split_text("('1'),('2')") == ["('1')", "('2')"] assert split_te

我正在尝试拆分一些文本。基本上,我想分开一级括号,比如
“('1','a',NULL),(2,'b')”
=>
[('1','a',NULL)”,“(2,'b')]”
,但我需要知道里面可能有带引号的字符串。它至少需要满足以下py.tests:

from splitter import split_text


def test_normal():
    assert split_text("('1'),('2')") == ["('1')", "('2')"]
    assert split_text("(1),(2),(3)") == ["(1)", "(2)", "(3)"]


def test_complex():
    assert split_text("('1','a'),('2','b')") == ["('1','a')", "('2','b')"]
    assert split_text("('1','a',NULL),(2,'b')") == ["('1','a',NULL)", "(2,'b')"]


def test_apostrophe():
    assert split_text("('\\'1','a'),('2','b')") == ["('\\'1','a')", "('2','b')"]


def test_coma_in_string():
    assert split_text("('1','a,c'),('2','b')") == ["('1','a,c')", "('2','b')"]


def test_bracket_in_string():
    assert split_text("('1','a)c'),('2','b')") == ["('1','a)c')", "('2','b')"]


def test_bracket_and_coma_in_string():
    assert split_text("('1','a),(c'),('2','b')") == ["('1','a),(c')", "('2','b')"]


def test_bracket_and_coma_in_string_apostrophe():
    assert split_text("('1','a\\'),(c'),('2','b')") == ["('1','a\\'),(c')", "('2','b')"]
我尝试了以下方法:

1)正则表达式

这看起来是最好的解决方案,但不幸的是,我没有找到任何满足所有测试的方法

我的最佳尝试是:

def split_text(text):
    return re.split('(?<=\)),(?=\()', text)
它工作正常,通过了所有测试,但速度非常慢

3)pyparsing

from pyparsing import QuotedString, ZeroOrMore, Literal, Group, Suppress, Word, nums

null_value = Literal('NULL')
number_value = Word(nums)
string_value = QuotedString("'", escChar='\\', unquoteResults=False)
value = null_value | number_value | string_value
one_bracket = Group(Literal('(') + value + ZeroOrMore(Literal(',') + value) + Literal(')'))
all_brackets = one_bracket + ZeroOrMore(Suppress(',') + one_bracket)


def split_text(text):
    parse_result = all_brackets.parseString(text)
    return [''.join(a) for a in parse_result]
也通过了所有测试,但令人惊讶的是,它比解决方案2更慢


如何使解决方案快速、可靠?我有一种感觉,我错过了一些明显的东西。

我做了这个,它在给定的测试中有效

tests = ["('1'),('2')",
"(1),(2),(3)",
"('1','a'),('2','b')",
"('1','a',NULL),(2,'b')",
"('\\'1','a'),('2','b')",
"('1','a,c'),('2','b')",
"('1','a)c'),('2','b')",
"('1','a),(c'),('2','b')",
"('1','a\\'),(c'),('2','b')"]

for text in tests:
    tmp = ''
    res = []
    bracket = 0
    quote = False

    for idx,i in enumerate(text):
        if i=="'":
            if text[idx-1]!='\\':
                quote = not quote
            tmp += i
        elif quote:
            tmp += i
        elif i==',':
            if bracket: tmp += i
            else:   pass
        else:
            if i=='(':      bracket += 1
            elif i==')':    bracket -= 1

            if bracket:   tmp += i
            else:
                tmp += i
                res.append(tmp)
                tmp = ''

    print res
输出:

["('1')", "('2')"]
['(1)', '(2)', '(3)']
["('1','a')", "('2','b')"]
["('1','a',NULL)", "(2,'b')"]
["('\\'1','a')", "('2','b')"]
["('1','a,c')", "('2','b')"]
["('1','a)c')", "('2','b')"]
["('1','a),(c')", "('2','b')"]
["('1','a\\'),(c')", "('2','b')"]

代码还有改进的余地,欢迎编辑。:)

一种方法是使用支持
(*跳过)(*失败)
功能的较新模块:

import regex as re

def split_text(text):
    rx = r"""'.*?(?<!\\)'(*SKIP)(*FAIL)|(?<=\)),(?=\()"""
    return re.split(rx, text)
将regex作为re导入
def拆分_文本(文本):

rx=r”“”.*(?这是一个正则表达式,它似乎可以工作并通过所有测试。在实际数据上运行它比在Python中实现的有限状态机快6倍左右

PATTERN = re.compile(
    r"""
        \(  # Opening bracket

            (?:

            # String
            (?:'(?:
               (?:\\')|[^']  # Either escaped apostrophe, or other character
               )*'
            )
            |
            # or other literal not containing right bracket
            [^')]

            )

            (?:, # Zero or more of them separated with comma following the first one

            # String
            (?:'(?:
               (?:\\')|[^']  # Either escaped apostrophe, or other character
               )*'
            )
            |
            # or other literal
            [^')]

            )*

        \)  # Closing bracket
    """,
    re.VERBOSE)


def split_text(text):
    return PATTERN.findall(text)

你认为它会比解决方案2和3快吗?因此,我认为它或多或少是FSM,真实数据的速度与解决方案2相当。解决方案是O(n)对于每个字符串,其中n是字符串的长度。因此,是的,这是最快的,因为您需要至少扫描一次字符串。与解决方案#2相比,我认为我的更短。从方法上看,它们是相似的,并且两者都是O(n)。运行时间的差异(如果有的话)可能是因为实现的不同。顺便说一句,你能分享一下你是如何决定一个算法是否慢的吗?有趣的是,我不知道这个库。出于一些原因,我不能使用它,但会关注它。谢谢!+1
'.*?(?<!\\)'     # look for a single quote up to a new single quote
                 # that MUST NOT be escaped (thus the neg. lookbehind)
(*SKIP)(*FAIL)|  # these parts shall fail
(?<=\)),(?=\()   # your initial pattern with a positive lookbehind/ahead
PATTERN = re.compile(
    r"""
        \(  # Opening bracket

            (?:

            # String
            (?:'(?:
               (?:\\')|[^']  # Either escaped apostrophe, or other character
               )*'
            )
            |
            # or other literal not containing right bracket
            [^')]

            )

            (?:, # Zero or more of them separated with comma following the first one

            # String
            (?:'(?:
               (?:\\')|[^']  # Either escaped apostrophe, or other character
               )*'
            )
            |
            # or other literal
            [^')]

            )*

        \)  # Closing bracket
    """,
    re.VERBOSE)


def split_text(text):
    return PATTERN.findall(text)