Python 解析冒号分隔的数据_Python_Regex_Multilinestring

Python 解析冒号分隔的数据

python regex

Python 解析冒号分隔的数据,python,regex,multilinestring,Python,Regex,Multilinestring,我有以下文本块： string = """ apples: 20 oranges: 30 ripe: yes farmers: elmer fudd lives in tv farmer ted lives close farmer bill lives far selli

我有以下文本块：

string = """
    apples: 20
    oranges: 30
    ripe: yes
    farmers:
            elmer fudd
                   lives in tv
            farmer ted
                   lives close
            farmer bill
                   lives far
    selling: yes
    veggies:
            carrots
            potatoes
    """

我试图找到一个好的正则表达式，它允许我解析出键值。我可以通过以下方式获取单行键值：

'(.+?):\s(.+?)\n'

re.findall( '(.+?):\s(.+?)\n', string, re.S),

{ 'apples': 20, 'farmers': ['elmer fudd', 'farmer ted'] }

然而，当我攻击农民或蔬菜时，问题就来了

使用re标志，我需要执行以下操作：

'(.+?):\s(.+?)\n'

re.findall( '(.+?):\s(.+?)\n', string, re.S),

{ 'apples': 20, 'farmers': ['elmer fudd', 'farmer ted'] }

然而，我花了很长时间去了解所有与农民相关的价值观

当值是多行时，每个值后面都有一个换行符，值前面有一个制表符或一系列制表符

我们的目标是要有这样的东西：

'(.+?):\s(.+?)\n'

re.findall( '(.+?):\s(.+?)\n', string, re.S),

{ 'apples': 20, 'farmers': ['elmer fudd', 'farmer ted'] }

等等

提前感谢您的帮助。

您可能会看到，此文本非常接近，如果实际上不是有效的YAML。

这里有一个非常愚蠢的方法：

import collections


string = """
    apples: 20
    oranges: 30
    ripe: yes
    farmers:
            elmer fudd
                   lives in tv
            farmer ted
                   lives close
            farmer bill
                   lives far
    selling: yes
    veggies:
            carrots
            potatoes
    """


def funky_parse(inval):
    lines = inval.split("\n")
    items = collections.defaultdict(list)
    at_val = False
    key = ''
    val = ''
    last_indent = 0
    for j, line in enumerate(lines):
        indent = len(line) - len(line.lstrip())
        if j != 0 and at_val and indent > last_indent > 4:
            continue
        if j != 0 and ":" in line:
            if val:
                items[key].append(val.strip())
            at_val = False
            key = ''
        line = line.lstrip()
        for i, c in enumerate(line, 1):
            if at_val:
                val += c
            else:
                key += c
            if c == ':':
                at_val = True
            if i == len(line) and at_val and val:
                items[key].append(val.strip())
                val = ''
        last_indent = indent

    return items

print dict(funky_parse(string))

输出

{'farmers:': ['elmer fudd', 'farmer ted', 'farmer bill'], 'apples:': ['20'], 'veggies:': ['carrots', 'potatoes'], 'ripe:': ['yes'], 'oranges:': ['30'], 'selling:': ['yes']}

下面是一个非常愚蠢的方法：

import collections


string = """
    apples: 20
    oranges: 30
    ripe: yes
    farmers:
            elmer fudd
                   lives in tv
            farmer ted
                   lives close
            farmer bill
                   lives far
    selling: yes
    veggies:
            carrots
            potatoes
    """


def funky_parse(inval):
    lines = inval.split("\n")
    items = collections.defaultdict(list)
    at_val = False
    key = ''
    val = ''
    last_indent = 0
    for j, line in enumerate(lines):
        indent = len(line) - len(line.lstrip())
        if j != 0 and at_val and indent > last_indent > 4:
            continue
        if j != 0 and ":" in line:
            if val:
                items[key].append(val.strip())
            at_val = False
            key = ''
        line = line.lstrip()
        for i, c in enumerate(line, 1):
            if at_val:
                val += c
            else:
                key += c
            if c == ':':
                at_val = True
            if i == len(line) and at_val and val:
                items[key].append(val.strip())
                val = ''
        last_indent = indent

    return items

print dict(funky_parse(string))

输出

{'farmers:': ['elmer fudd', 'farmer ted', 'farmer bill'], 'apples:': ['20'], 'veggies:': ['carrots', 'potatoes'], 'ripe:': ['yes'], 'oranges:': ['30'], 'selling:': ['yes']}

下面是一个非常愚蠢的解析器，它考虑了（明显的）缩进规则：

def parse(s):
    d = {}
    lastkey = None
    for fullline in s:
        line = fullline.strip()
        if not line:
            pass
        elif ':' not in line:
            indent = len(fullline) - len(fullline.lstrip())
            if lastindent is None:
                lastindent = indent
            if lastindent == indent:
                lastval.append(line)
        else:
            if lastkey:
                d[lastkey] = lastval
                lastkey = None
            if line.endswith(':'):
                lastkey, lastval, lastindent = key, [], None
            else:
                key, _, value = line.partition(':')
                d[key] = value.strip()
    if lastkey:
        d[lastkey] = lastval
        lastkey = None
    return d

import pprint
pprint(parse(string.splitlines()))

输出为：

{'apples': '20',
 'oranges': '30',
 'ripe': ['elmer fudd', 'farmer ted', 'farmer bill'],
 'selling': ['carrots', 'potatoes']}

我认为这已经够复杂了，作为一个显式状态机，它看起来会更清晰，但我想用任何新手都能理解的术语来编写它。

这里有一个非常愚蠢的解析器，它考虑了（明显的）缩进规则：

def parse(s):
    d = {}
    lastkey = None
    for fullline in s:
        line = fullline.strip()
        if not line:
            pass
        elif ':' not in line:
            indent = len(fullline) - len(fullline.lstrip())
            if lastindent is None:
                lastindent = indent
            if lastindent == indent:
                lastval.append(line)
        else:
            if lastkey:
                d[lastkey] = lastval
                lastkey = None
            if line.endswith(':'):
                lastkey, lastval, lastindent = key, [], None
            else:
                key, _, value = line.partition(':')
                d[key] = value.strip()
    if lastkey:
        d[lastkey] = lastval
        lastkey = None
    return d

import pprint
pprint(parse(string.splitlines()))

输出为：

{'apples': '20',
 'oranges': '30',
 'ripe': ['elmer fudd', 'farmer ted', 'farmer bill'],
 'selling': ['carrots', 'potatoes']}

我认为这已经够复杂了，作为一个显式的状态机，它看起来会更清晰，但我想用任何新手都能理解的术语来编写它。

很接近，但我相信

farmers

将以一个长字符串结束-它不是一个完整的列表…如果我能抓住值，我可以用换行符分割，并构建列表。然而，试图找出如何最好地获取值。这很接近，但我相信

farmers

将以一个长字符串结束-它不是一个完整的列表…如果可以获取值，我可以按换行符拆分，并构建列表。然而，试图找出如何最好地抓住价值观。“电视生活”这一部分有意义吗？您在所需的输出中没有提到它。这种方法如何：按换行符拆分存储为

，逐行遍历每一行，然后按

拆分：'

。如果第二部分不是空的，则将这两对作为键和值添加到词典中，并从

中弹出该行；接下来，您将只剩下一个键列表（带“：”），而其他所有键都将进入该键的列表。运行修剪过的

并将剩余的添加到字典中。为什么

中的“生活在电视中”

不在列表中？或者“农民账单”，就这一点而言，“电视生活”这一部分有意义吗？您在所需的输出中没有提到它。这种方法如何：按换行符拆分存储为

，逐行遍历每一行，然后按

拆分：'

。如果第二部分不是空的，则将这两对作为键和值添加到词典中，并从

中弹出该行；接下来，您将只剩下一个键列表（带“：”），而其他所有键都将进入该键的列表。运行修剪过的

并将剩余的添加到字典中。为什么

中的“生活在电视中”

不在列表中？或者“农民账单”，谢谢，这是一个非常干净的解决方案。我最初试图用正则表达式解决这个问题，但可能正则表达式不值得这么做，而且会带来更大的复杂性。@user2152283:每当我想不出如何使用正则表达式时（即使我确信它是我试图解析的一种常规语言），我都会后退一步，尝试以另一种方式编写它。有时这让我下意识地理解了regexp；有时这意味着我最终得到了一个非基于regexp但可读的解析器；有时我会向自己证明，这种语言是不规则的，甚至是上下文敏感的，我需要一些更复杂的东西……但无论如何，这是一个胜利。谢谢，这是一个非常干净的解决方案。我最初试图用正则表达式解决这个问题，但可能正则表达式不值得这么做，而且会带来更大的复杂性。@user2152283:每当我想不出如何使用正则表达式时（即使我确信它是我试图解析的一种常规语言），我都会后退一步，尝试以另一种方式编写它。有时这让我下意识地理解了regexp；有时这意味着我最终得到了一个非基于regexp但可读的解析器；有时我会向自己证明，这种语言是不规则的，甚至是上下文敏感的，我需要更复杂的东西……但不管怎样，这是一种胜利。