在函数名上拆分SQL语句，但在Python中保留分隔符_Python_Sql_Regex_Split

在函数名上拆分SQL语句，但在Python中保留分隔符

python sql regex

在函数名上拆分SQL语句，但在Python中保留分隔符,python,sql,regex,split,Python,Sql,Regex,Split,假设我有以下字符串，其中包含从SELECT子句中提取的SQL语句（实际上，这是一个包含数百条此类语句的大型SQL语句） PS：出于演示目的，我提供了一些缩进，但实际上这些语句用逗号分隔，这意味着没有空白或新行以原始形式出现。我有一个解决方案，但代码有点多。这不需要使用regex，只需要对关键字进行多次拆分 s = """ SUM(case when(A.money-B.money>1000 and A.unixtime-B.unixtime<=8907

假设我有以下

字符串

，其中包含从

SELECT

子句中提取的SQL语句（实际上，这是一个包含数百条此类语句的大型SQL语句）

PS：出于演示目的，我提供了一些缩进，但实际上这些语句用逗号分隔，这意味着没有空白或新行以原始形式出现。

我有一个解决方案，但代码有点多。这不需要使用

regex

，只需要对关键字进行多次拆分

s = """
SUM(case when(A.money-B.money>1000
                and A.unixtime-B.unixtime<=890769
                and B.col10 = "A"
                and B.col11 = "12"
                and B.col12 = "V") then 10
      end) as finalCond0,
MAX(case when(A.money-B.money<0
                and A.unixtime-B.unixtime<=6786000
                and B.cond1 = "A"
                and B.cond2 = "4321"
                and B.cond3 in ("E", "F", "G")) then A.col10
      end) as finalCond1,
SUM(case when(A.money-B.money>0
                and A.unixtime-B.unixtime<=6786000
                and B.cond1 = "A"
                and B.cond2 = "1234"
                and B.cond3 in ("A", "B", "C")) then 2
      end) as finalCond2 
"""

# remove newlines and doble spaces
s = s.replace('\n', ' ')
while '  ' in s:
    s = s.replace('  ', ' ')
s = s.strip()

# split on keywords, starting with the original string
current_parts = [s, ]
for kw in ['SUM', 'MAX', 'MIN']:
    new_parts = []
    for part in current_parts:
        for i, new_part in enumerate(part.split(kw)):
            if i > 0:
                # add keyword to the start of this substring
                new_part = '{}{}'.format(kw, new_part)

            new_part = new_part.strip()
            if len(new_part) > 0:
                new_parts.append(new_part.strip())

    current_parts = new_parts

print()
print('current_parts:')
for s in current_parts:
    print(s)

s=”“”
金额（A.money-B.money>1000时的情况）
和A.unixtime-B.unixtime这里不能使用正则表达式，因为SQL语法没有形成可以与Pythonre
引擎匹配的正则模式。您必须实际将字符串解析为令牌流或语法树；您的SUM（…）
可以包含大量语法，毕竟包括子选择
即使这是一个复杂的过程，也可以做到这一点
重新使用我在链接到的另一篇文章中定义的walk_代币
功能：
from collections import deque
from sqlparse.sql import TokenList

def walk_tokens(token):
    queue = deque([token])
    while queue:
        token = queue.popleft()
        if isinstance(token, TokenList):
            queue.extend(token)
        yield token

从SELECT
标识符列表中提取最后一个元素是：
import sqlparse
from sqlparse.sql import IdentifierList

tokens = sqlparse.parse(sql)[0]
for tok in walk_tokens(tokens):
    if isinstance(tok, IdentifierList):
        # iterate to leave the last assigned to `identifier`
        for identifier in tok.get_identifiers():
            pass
        break

print(identifier)

演示：
>sql=''\
…金额（A.money-B.money>1000时的情况）
…和A.unixtime-B.unixtime您可以使用如下内容：
import re

str = 'SUM(case when(A.money-B.money>1000 and A.unixtime-B.unixtime<=890769 and B.col10 = "A" and B.col11 = "12" and B.col12 = "V") then 10 end) as finalCond0, MAX(case when(A.money-B.money<0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "4321" and B.cond3 in ("E", "F", "G")) then A.col10 end) as finalCond1, SUM(case when(A.money-B.money>0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "1234" and B.cond3 in ("A", "B", "C")) then 2 end) as finalCond2'

result = re.finditer('as\s+[a-zA-Z0-9]+', str);

commas = []
parts = []

for reg in result:
    end = reg.end()
    if(len(str) > end and str[end] == ','):
        commas.append(end)

idx = 0
for comma in commas:
    parts.append(str[idx:comma])
    idx = comma + 1
parts.append(str[idx:])

print(parts)

在部件中，您将拥有带有部件的最终阵列（不确定此实现是否是最佳方式）：
[
'SUM（当（A.money-B.money>1000和A.unixtime-B.unixtime）您是否尝试过用逗号（，
）拆分？@Ralf在这种情况下不起作用。，
（sql.split（'，'））.pop（.split（
）将给出“C”））然后2 end）作为finalCond2
Hm…和.split（'，\n'））
？@Ralf也不起作用。出于演示目的，我提供了某种缩进，但实际上这些语句只是用逗号分隔，意思是没有空格或新行以原始形式出现。但我已经告诉过您，没有新行（\n
）在字符串中。无论字符串中是否有换行符，我发布的代码都会起作用。我只是确保在开始时没有换行符，但如果字符串中没有任何换行符，这不会影响结果。好的。谢谢您的尝试。使用正则表达式不会更容易吗？@Old School:不，因为Python的正则表达式不能用于解析嵌套结构。@老派：SQL不是“正则的”，你不能预测括号的数量，也不能预测逗号的使用位置等。即使我没有完整的SQL语句，这个解决方案也行吗？这意味着它们只是SELECT
语句的标识符，而不是actual语句本身。@Old School:我使用了您的文字输入，它在演示中。看不到选择或从或加入。此代码的输出是否接近我在问题中发布的所需输出？为您编辑，现在检查！
from collections import deque
from sqlparse.sql import TokenList

def walk_tokens(token):
    queue = deque([token])
    while queue:
        token = queue.popleft()
        if isinstance(token, TokenList):
            queue.extend(token)
        yield token

import sqlparse
from sqlparse.sql import IdentifierList

tokens = sqlparse.parse(sql)[0]
for tok in walk_tokens(tokens):
    if isinstance(tok, IdentifierList):
        # iterate to leave the last assigned to `identifier`
        for identifier in tok.get_identifiers():
            pass
        break

print(identifier)

>>> sql = '''\
...   SUM(case when(A.money-B.money>1000
...                 and A.unixtime-B.unixtime<=890769
...                 and B.col10 = "A"
...                 and B.col11 = "12"
...                 and B.col12 = "V") then 10
...       end) as finalCond0,
...   MAX(case when(A.money-B.money<0
...                 and A.unixtime-B.unixtime<=6786000
...                 and B.cond1 = "A"
...                 and B.cond2 = "4321"
...                 and B.cond3 in ("E", "F", "G")) then A.col10
...         end) as finalCond1,
...   SUM(case when(A.money-B.money>0
...                 and A.unixtime-B.unixtime<=6786000
...                 and B.cond1 = "A"
...                 and B.cond2 = "1234"
...                 and B.cond3 in ("A", "B", "C")) then 2
...       end) as finalCond2
... '''
>>> tokens = sqlparse.parse(sql)[0]
>>> for tok in walk_tokens(tokens):
...     if isinstance(tok, IdentifierList):
...         # iterate to leave the last assigned to `identifier`
...         for identifier in tok.get_identifiers():
...             pass
...         break
...
>>> print(identifier)
SUM(case when(A.money-B.money>0
                and A.unixtime-B.unixtime<=6786000
                and B.cond1 = "A"
                and B.cond2 = "1234"
                and B.cond3 in ("A", "B", "C")) then 2
      end) as finalCond2

import re

str = 'SUM(case when(A.money-B.money>1000 and A.unixtime-B.unixtime<=890769 and B.col10 = "A" and B.col11 = "12" and B.col12 = "V") then 10 end) as finalCond0, MAX(case when(A.money-B.money<0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "4321" and B.cond3 in ("E", "F", "G")) then A.col10 end) as finalCond1, SUM(case when(A.money-B.money>0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "1234" and B.cond3 in ("A", "B", "C")) then 2 end) as finalCond2'

result = re.finditer('as\s+[a-zA-Z0-9]+', str);

commas = []
parts = []

for reg in result:
    end = reg.end()
    if(len(str) > end and str[end] == ','):
        commas.append(end)

idx = 0
for comma in commas:
    parts.append(str[idx:comma])
    idx = comma + 1
parts.append(str[idx:])

print(parts)

[151, 322]

[
    'SUM(case when(A.money-B.money>1000 and A.unixtime-B.unixtime<=890769 and B.col10 = "A" and B.col11 = "12" and B.col12 = "V") then 10 end) as finalCond0',
    ' MAX(case when(A.money-B.money<0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "4321" and B.cond3 in ("E", "F", "G")) then A.col10 end) as finalCond1',
    ' SUM(case when(A.money-B.money>0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "1234" and B.cond3 in ("A", "B", "C")) then 2 end) as finalCond2'
]