python字典的固定宽度文本文件

python字典的固定宽度文本文件,python,text,io,Python,Text,Io,我试图在Python中导入一个类似于下面报告的文本文件 + CATEGORY_1 first_part of long attribute <NAME_a> | ...second part of long attribute | + CATEGORY_2: a sequence of attributes that extend over | | ... possibly many <NAME_b> | | ... lines

我试图在Python中导入一个类似于下面报告的文本文件

+ CATEGORY_1 first_part of long attribute <NAME_a>
|     ...second part of long attribute
|    + CATEGORY_2: a sequence of attributes that extend over 
|    |     ... possibly many <NAME_b>
|    |     ... lines
|    |    + SOURCE_1 => source_code 
|    + CATEGORY_2: another sequence of attributes that extend over <NAME_c>
|    |     ... possibly many lines
|    |    + CATEGORY_1: yet another sequence of <NAME_d> attributes that extend over
|    |    |     ...many lines 
|    |    |    + CATEGORY_2: I really think <NAME_e> that
|    |    |    |     ... you got the point 
|    |    |    |     ... now
|    |    |    |    + SOURCE_1 => source_code 
|    + SOURCE_2 => path_to_file 
在我看来,这里的主要思想是在行开始之前计算制表符的数量,这将决定层次结构。 我试过看熊猫的read_fwf和numpy loadfromtxt,但没有成功。
你能给我指出解决这个问题的相关模块或策略吗?

不是一个完整的答案,但你可以使用堆栈的方法

每次输入类别时,都会将类别键按到堆栈中。 然后你读这一行,检查标签的数量,并根据需要存储。如果级别与前一级别相同或更高,则从堆栈中弹出项目。 然后您只需要使用基本正则表达式来提取项目

一些Python/伪代码,这样您就可以有一个想法了

levels = []
items = {}
last_level = 0

for line in file:
   current_level = count_tabs()
   if current_level > last_level:
      name = extract_name(line)
      levels.append(name)
      items = fill_dictionary_in_level(name, line)
   else:
      levels.pop()
   last_level = current_level

return items
以下是一个策略:

对于每一行,使用RegEx解析该行并提取数据

这是一份草稿:

import re

line = "|    + CATEGORY_2: another sequence of attributes that extend over <NAME_c>"

level = line.count("|") + 1
mo = re.match(r".*\+\s+(?P<category>[^:]+):.*<(?P<name>[^>]+)>", line)
category = mo.group("category")
name = mo.group("name")

print("level: {0}".format(level))
print("category: {0}".format(category))
print("name: {0}".format(name))

任何关于如何解决这个问题的提示都将不胜感激。不要只寻找“开箱即用”的解决方案。策略:由于您的数据结构是扁平的(它是一个文本文件),您需要开发自己的解析器来猜测级别、识别名称……要构建字典结构,您需要一个堆栈。
import re

line = "|    + CATEGORY_2: another sequence of attributes that extend over <NAME_c>"

level = line.count("|") + 1
mo = re.match(r".*\+\s+(?P<category>[^:]+):.*<(?P<name>[^>]+)>", line)
category = mo.group("category")
name = mo.group("name")

print("level: {0}".format(level))
print("category: {0}".format(category))
print("name: {0}".format(name))
level: 2
category: CATEGORY_2
name: NAME_c