Python 如何从多行字符串（或列表）中提取列名？_Python_Regex_String_Parsing

Python 如何从多行字符串（或列表）中提取列名？

python regex string parsing

Python 如何从多行字符串（或列表）中提取列名？,python,regex,string,parsing,Python,Regex,String,Parsing,我有以下字符串示例： column_names = """ ================================================================================ total total final store. t

我有以下字符串示例：

column_names = """
    ================================================================================
                                             total                   total     final
    store.                        toys       output   person 1/     usage 5/   stock
    ================================================================================
"""

我可以把它一行一行地分解如下：

column_lines = [
'    ================================================================================',
'                                             total                   total     final',
'    store.                        toys       output   person 1/     usage 5/   stock',
'    ================================================================================',
]

在不知道字符串中的文本的情况下，我想找到一种方法，以获得以下列表：

[“商店”、“玩具”、“总产量”、“人员”、“总使用量”、“最终库存”]

我正在努力找到解决这个问题的方法

解决此问题的不同方法有哪些？如何从多行文本中提取字符串，而不知道列名称是什么？

基本工作解决方案这里有一个有效的解决方案。我们需要指定两行来“分组”

def find_group(l1, l2):

    def intersect(x1, x2):
        return (x1[0] <= x2[1] and x1[1] >= x2[0]) \
            or (x2[0] <= x1[1] and x2[1] >= x1[0])

    pat = r"[a-zA-Z]+"
    matches1 = [(match.start(0), match.end(0)) for match in re.finditer(pat, l1)]
    matches2 = [(match.start(0), match.end(0)) for match in re.finditer(pat, l2)]

    ret = []
    for g2 in matches2:
        add_g2 = True
        for g1 in matches1:
            if intersect(g1, g2):
                ret.append(l1[g1[0]:g1[1]]+" "+l2[g2[0]:g2[1]])
                add_g2 = False
                break
        if add_g2:
            ret.append(l2[g2[0]:g2[1]])
                   
return ret

通解这里有一个解决方案，可以处理任意数量的行

def find_group(lines):

    if isinstance(lines, str):
        lines = lines.split("\n")

    def intersect(x1, x2):
        """Checks if two couples of x-coordinates intersect."""
        return (x1[0] <= x2[1] and x1[1] >= x2[0]) \
            or (x2[0] <= x1[1] and x2[1] >= x1[0])

    pat = r"[a-zA-Z]+"
    # Coordinates of all parts matching the pattern, per line
    matches = [[(match.start(0), match.end(0)) for match in re.finditer(pat, line)] 
           for line in lines]

    # Starts by comparing line 0 and line 1
    groups = matches[0]
    for i in range(1, len(lines)):
        for g2 in matches[i]:
            add_g2 = True
            for i_g1, g1 in enumerate(groups):
                if intersect(g1, g2):
                    # Merge both lines intersection into the variable groups
                    groups[i_g1] = [min(g1[0], g2[0]), max(g1[1], g2[1])]
                    add_g2 = False
                    break
            if add_g2:
                # If alone in the x-coord, adds the match as a new group
                groups.append([g2[0], g2[1]])
            # "groups" becomes the merge of the first i lines results.
            
    # Sorts the groups by their first coordinate.
    # Then joins all matches located between each group's coordinates
    listed_groups = [[" ".join(re.findall(pat, line[group[0]: group[1]])) 
                  for line in lines]
                 for group in sorted(groups)]

    # Replaces all unnecessary whitespaces and format groups as strings
    return [re.sub("\s+", " ", " ".join(g).strip()) for g in listed_groups]

如果您需要更多解释，请告诉我。

基本工作解决方案这里有一个有效的解决方案。我们需要指定两行来“分组”

def find_group(l1, l2):

    def intersect(x1, x2):
        return (x1[0] <= x2[1] and x1[1] >= x2[0]) \
            or (x2[0] <= x1[1] and x2[1] >= x1[0])

    pat = r"[a-zA-Z]+"
    matches1 = [(match.start(0), match.end(0)) for match in re.finditer(pat, l1)]
    matches2 = [(match.start(0), match.end(0)) for match in re.finditer(pat, l2)]

    ret = []
    for g2 in matches2:
        add_g2 = True
        for g1 in matches1:
            if intersect(g1, g2):
                ret.append(l1[g1[0]:g1[1]]+" "+l2[g2[0]:g2[1]])
                add_g2 = False
                break
        if add_g2:
            ret.append(l2[g2[0]:g2[1]])
                   
return ret

通解这里有一个解决方案，可以处理任意数量的行

def find_group(lines):

    if isinstance(lines, str):
        lines = lines.split("\n")

    def intersect(x1, x2):
        """Checks if two couples of x-coordinates intersect."""
        return (x1[0] <= x2[1] and x1[1] >= x2[0]) \
            or (x2[0] <= x1[1] and x2[1] >= x1[0])

    pat = r"[a-zA-Z]+"
    # Coordinates of all parts matching the pattern, per line
    matches = [[(match.start(0), match.end(0)) for match in re.finditer(pat, line)] 
           for line in lines]

    # Starts by comparing line 0 and line 1
    groups = matches[0]
    for i in range(1, len(lines)):
        for g2 in matches[i]:
            add_g2 = True
            for i_g1, g1 in enumerate(groups):
                if intersect(g1, g2):
                    # Merge both lines intersection into the variable groups
                    groups[i_g1] = [min(g1[0], g2[0]), max(g1[1], g2[1])]
                    add_g2 = False
                    break
            if add_g2:
                # If alone in the x-coord, adds the match as a new group
                groups.append([g2[0], g2[1]])
            # "groups" becomes the merge of the first i lines results.
            
    # Sorts the groups by their first coordinate.
    # Then joins all matches located between each group's coordinates
    listed_groups = [[" ".join(re.findall(pat, line[group[0]: group[1]])) 
                  for line in lines]
                 for group in sorted(groups)]

    # Replaces all unnecessary whitespaces and format groups as strings
    return [re.sub("\s+", " ", " ".join(g).strip()) for g in listed_groups]

如果您需要更多的解释，请告诉我。

这里有另一个解决方案，它肯定更长，但我认为更直观。它还允许您在输入中有多行，并按预期将它们整理在一起：

重新导入
列_name=“”
================================================================================
总决赛
商店玩具输出人1/使用量5/库存
这里的名字玩具人1/这里5/股票
================================================================================
"""
#去掉那些线
列名称=re.sub（r'\n\s*={2，}'，''，列名称）
#从起始位置删除换行符
column_names=re.sub（r’（？这里有另一个解决方案，我认为它肯定更长，但更直观。它还允许您在输入中有多行，并按照您的预期将它们整理在一起：
重新导入
列_name=“”
================================================================================
总决赛
商店玩具输出人1/使用量5/库存
这里的名字玩具人1/这里5/股票
================================================================================
"""
#去掉那些线
列名称=re.sub（r'\n\s*={2，}'，''，列名称）
#从起始位置删除换行符
列名称=re.sub（r'（？在第一种和第二种情况下，单词total和output没有正确对齐。输入看起来会是什么样子？例如，它们会向左对齐还是向右对齐，或者两者都对齐？此外，我怀疑有没有办法忽略1/
和5/
而不直接用相同字符数的空格替换它们ters.第一条评论的观点很好。我现在已经纠正了这一点-给出了对齐和不对齐的示例，正如它们都出现了一样。关于1/和5/，我相信正则表达式可以解决这个问题。尽管我非常喜欢使用正则表达式来解决非正则表达式问题，但我不相信有任何简单的方法可以使用正则表达式来实现单词分组。然而，使用普通python代码对它们进行分组要简单得多。我建议更改标记和问题，不要将其仅限于正则表达式，但如果出于某种原因您只需要正则表达式，您可能需要等待一段时间才能找到答案。解决这些问题可能更容易。在第一种和第二种情况下，单词total和output不对齐显然。输入看起来会是什么样子？比如，它们会向左对齐还是向右对齐，或者两者都对齐？此外，我怀疑有没有办法忽略1/
和5/
，而不直接用相同字符数的空格替换它们。第一条评论的要点很好。我现在已经纠正了这一点——给了两个字符一个空格n对齐和非对齐的示例都会出现。关于1/和5/，我相信正则表达式可以解决这个问题。虽然我非常喜欢使用正则表达式来解决非正则表达式问题，但我不相信有任何简单的方法可以使用正则表达式来实现单词分组。但是，使用普通python代码对它们进行分组要简单得多。我建议更改标签和问题不限于正则表达式，但如果出于某种原因您只需要正则表达式，您可能需要等待一段时间才能找到答案。解决这些问题可能更容易。