使用正则表达式分隔文本块-Python_Python_Parsing_Tuples

使用正则表达式分隔文本块-Python

python parsing

使用正则表达式分隔文本块-Python,python,parsing,tuples,Python,Parsing,Tuples,我从Stanford解析器获得以下输出： nicaragua president ends visit to finland . nn(ends-3, nicaragua-1) nn(ends-3, president-2) nsubj(visit-4, ends-3) xsubj(finland-6, ends-3) root(ROOT-0, visit-4) aux(finland-6, to-5) xcomp(visit-4, finland-6) guatemala presiden

我从Stanford解析器获得以下输出：

nicaragua president ends visit to finland .

nn(ends-3, nicaragua-1)
nn(ends-3, president-2)
nsubj(visit-4, ends-3)
xsubj(finland-6, ends-3)
root(ROOT-0, visit-4)
aux(finland-6, to-5)
xcomp(visit-4, finland-6)

guatemala president ends visit to tropos .

nn(ends-3, guatemala-1)
nn(ends-3, president-2)
nsubj(visit-4, ends-3)
xsubj(finland-6, ends-3)
root(ROOT-0, visit-4)
aux(tropos-6, to-5)
xcomp(visit-4, tropos-6)

[...]

我必须对这个输出进行分段，以便得到包含句子和所有依赖项列表的元组（如

（句子，[依赖项列表]）

每一句话。有人能给我推荐一种用Python实现的方法吗？谢谢！

你可以这样做，尽管对你正在解析的结构来说这可能有些过分。如果你还需要解析依赖项，那么扩展它应该相对容易。我还没有运行这个，甚至没有检查语法如果它不马上起作用，就不要杀我

READ_SENT = 0
PRE_DEPS = 1
DEPS = 2
POST_DEPS = 3
def parse_output(input):
    state = READ_SENT
    results = []
    sent = None
    deps = []
    for line in input.splitlines():
        if state == READ_SENT:
            sent = line
            state = PRE_DEPS
        elif state == PRE_DEPS:
             if line:
                 raise Exception('invalid format')
             else:
                 state = DEPS
         elif state == DEPS:
             if line:
                 deps.append(line)
             else:
                 state = POST_DEPS
         elif state == POST_DEPS:
             if line:
                 raise Exception('invalid format')
             else:
                 results.append((sent, deps))
                 sent = None
                 deps = []
                 state = READ_SENT
    return results