使用正则表达式分隔文本块-Python
我从Stanford解析器获得以下输出:使用正则表达式分隔文本块-Python,python,parsing,tuples,Python,Parsing,Tuples,我从Stanford解析器获得以下输出: nicaragua president ends visit to finland . nn(ends-3, nicaragua-1) nn(ends-3, president-2) nsubj(visit-4, ends-3) xsubj(finland-6, ends-3) root(ROOT-0, visit-4) aux(finland-6, to-5) xcomp(visit-4, finland-6) guatemala presiden
nicaragua president ends visit to finland .
nn(ends-3, nicaragua-1)
nn(ends-3, president-2)
nsubj(visit-4, ends-3)
xsubj(finland-6, ends-3)
root(ROOT-0, visit-4)
aux(finland-6, to-5)
xcomp(visit-4, finland-6)
guatemala president ends visit to tropos .
nn(ends-3, guatemala-1)
nn(ends-3, president-2)
nsubj(visit-4, ends-3)
xsubj(finland-6, ends-3)
root(ROOT-0, visit-4)
aux(tropos-6, to-5)
xcomp(visit-4, tropos-6)
[...]
我必须对这个输出进行分段,以便得到包含句子和所有依赖项列表的元组(如
(句子,[依赖项列表])
每一句话。有人能给我推荐一种用Python实现的方法吗?谢谢!你可以这样做,尽管对你正在解析的结构来说这可能有些过分。如果你还需要解析依赖项,那么扩展它应该相对容易。我还没有运行这个,甚至没有检查语法如果它不马上起作用,就不要杀我
READ_SENT = 0
PRE_DEPS = 1
DEPS = 2
POST_DEPS = 3
def parse_output(input):
state = READ_SENT
results = []
sent = None
deps = []
for line in input.splitlines():
if state == READ_SENT:
sent = line
state = PRE_DEPS
elif state == PRE_DEPS:
if line:
raise Exception('invalid format')
else:
state = DEPS
elif state == DEPS:
if line:
deps.append(line)
else:
state = POST_DEPS
elif state == POST_DEPS:
if line:
raise Exception('invalid format')
else:
results.append((sent, deps))
sent = None
deps = []
state = READ_SENT
return results