Python 3.x 从python中表示为字符串的配置单元结构数据类型中提取列_Python 3.x_Regex_Validation_Parsing_Re

Python 3.x 从python中表示为字符串的配置单元结构数据类型中提取列

python-3.x regex validation parsing

Python 3.x 从python中表示为字符串的配置单元结构数据类型中提取列,python-3.x,regex,validation,parsing,re,Python 3.x,Regex,Validation,Parsing,Re,在我的python程序中，我需要编写一个函数，以hive数据类型作为输入，并返回数据类型是否有效配置单元中支持的基本数据类型如下： supported_data_types: Set = {'void', 'boolean', 'tinyint', 'smallint', 'int', 'bigint', 'float', 'double', 'decimal', 'string', 'varchar', 'timestamp

在我的python程序中，我需要编写一个函数，以hive数据类型作为输入，并返回数据类型是否有效

配置单元中支持的基本数据类型如下：

supported_data_types: Set = {'void', 'boolean', 'tinyint', 'smallint', 'int', 'bigint', 'float', 'double',
                                  'decimal', 'string', 'varchar', 'timestamp', 'date', 'binary'}

配置单元支持的复杂数据类型包括：

arrays: array<data_type>
maps: map<primitive_type, data_type>
structs: struct<col_name : data_type [comment col_comment], ...>
union: union<data_type, data_type, ...>

我还设法验证数组并将数据类型映射到任何嵌套级别。我注意到，对于数组，任何有效类型都有语法：

array

。因此，如果我有一个类型

数组

，我会递归地验证

数据类型

。对于map，它的键始终是基元数据类型，因此，也可以递归地验证map数据类型。我使用以下函数递归验证数据类型

def-validate\u数据类型（\u-type:str）->bool:
_类型=_type.strip（）
如果配置单元中的类型支持数据类型：
返回真值
#数组类型有语法：Array，其中data\u type是任何有效的data\u类型
如果_type.startswith（“数组”表示数据类型“数组”表示数据类型“映射”这里的任务是查找所有出现的逗号和冒号，这些逗号和冒号的数量与其左侧的

相同。虽然正则表达式确实有一些奇特的功能，例如在搜索时存储变量的能力，但我认为这个问题超出了它们的范围egex只对简单的模式匹配有用，比如识别URL。它很容易让人忘乎所以，想在任何地方应用正则表达式，但这通常会导致代码无法读取

这里最简单的解决方案是简单的ol'迭代。要实现

提取列

函数，您需要一个

拆分

函数，该函数忽略嵌套在

中的分隔符：

def first_索引（target:str，string:str）->int:
#查找`string`中第一个顶级出现的`target`的索引`
深度=0
对于范围内的i（len（string））：
如果字符串[i]=='>'：
深度-=1
如果字符串[i]==目标且深度==0：
返回i
如果字符串[i]='

，您将从堆栈中弹出。这可以在字符串的一次扫描中完成。但比上述代码复杂得多

另外，一般建议：尝试返回

False

而不是使用

assert

。这样，当输入无效字符串时，您就不必担心代码崩溃

祝项目顺利！

我明白你的意思。所以社区真的很棒。感谢你为解决这个问题所做的努力。

@abstractmethod
def extract_columns(dt: str) -> List[Tuple[str, str]]:
    pass

dt = "struct<col_1: int, col_2 : struct<nested_col_1 : int, nested_col_2 : str>>"

extract_columns(dt)
>> [('col_1','int'), ('col_2', 'struct<nested_col_1 : int, nested_col_2 : str>')]

def first_index(target: str, string: str) -> int:
    # Finds the index of the first top-level occurrence of `target` in `string`
    depth = 0
    for i in range(len(string)):
        if string[i] == '>':
            depth -= 1
        if string[i] == target and depth == 0:
            return i
        if string[i] == '<':
            depth += 1
    # No matches were found
    return len(string)
def top_level_split(string: str, separator: str) -> [str]:
    # Splits `string` on every occurrence of `separator` that is not nested in a "<>"
    split = []
    while string != '':
        index = first_index(target=separator, string=string)
        split.append(string[:index])
        string = string[index + 1 :]
    return split

def is_valid_datatype(string: str) -> bool:
    
    if is_valid_primitive(string):
        return True
    
    # Find the opening carrot
    left = first_index(target='<', string=string)
    # Find the closing carrot
    right = first_index(target='>', string=string)
    if left > right or right == len(string):
        return False
    # Make sure there isn't anything to the right of `right`
    if string[right + 1 : ].strip() != '':
        return False
    
    # The name of the data type, e.g. "array"
    type_name = string[ : left].strip()
    # The substring between the carrots
    contents = string[left + 1 : right]
    
    if type_name == 'array':
        return is_valid_datatype(contents)
    if type_name == 'map':
        # We don't need `top_level_split` here because the first type is always primitive
        split = contents.split(',', 1)
        return len(split) == 2 and is_valid_primitive(split[0]) and is_valid_datatype(split[1])
    if type_name == 'struct':
        # Get each column by splitting on top-level commas
        for column in top_level_split(string=contents, separator=','):
            # We don't need `top_level_split` here because the first type is a column name
            split = column.split(':', 1)
            if not (len(split) == 2 and is_valid_colname(split[0]) and is_valid_datatype(split[1])):
                return False
        # All columns were valid!
        return True
    if type_name == 'union':
        for entry in top_level_split(string=contents, separator=','):
            if not is_valid_datatype(entry):
                return False
        # All entries were valid!
        return True

    # The type name is not recognized
    return False