Parsing 创建.conll文件作为Stanford解析器的输出_Parsing_Format_Stanford Nlp

Parsing 创建.conll文件作为Stanford解析器的输出

parsing stanford-nlp

Parsing 创建.conll文件作为Stanford解析器的输出,parsing,format,stanford-nlp,Parsing,Format,Stanford Nlp,我想使用Stanford解析器创建一个.conll文件以进行进一步处理。到目前为止，我成功地用命令解析了测试语句： stanford-parser-full-2013-06-20/lexparser.sh stanford-parser-full-2013-06-20/data/testsent.txt > output.txt 我希望在.conll中有一个文件，而不是txt文件。我非常肯定这是可能的，因为文档中提到了这一点（请参阅）。我可以修改我的命令吗，还是必须编写Java代码

我想使用Stanford解析器创建一个.conll文件以进行进一步处理。到目前为止，我成功地用命令解析了测试语句：

stanford-parser-full-2013-06-20/lexparser.sh  stanford-parser-full-2013-06-20/data/testsent.txt > output.txt

我希望在.conll中有一个文件，而不是txt文件。我非常肯定这是可能的，因为文档中提到了这一点（请参阅）。我可以修改我的命令吗，还是必须编写Java代码

谢谢你的帮助

我不确定您是否可以通过命令行执行此操作，但这是一个java版本：

for (List<HasWord> sentence : new DocumentPreprocessor(new StringReader(filename))) {
        Tree parse = lp.apply(sentence);

        GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
        GrammaticalStructure.printDependencies(gs, gs.typedDependencies(), parse, true, false);
}

for（列表语句：新文档预处理器（新StringReader（文件名）））{
树解析=lp.apply（句子）；
语法结构gs=gsf.newgrammaticstructure（parse）；
语法结构.printDependencies（gs，gs.typedDependencies（），parse，true，false）；
}

如果要查找以CoNLL X（CoNLL 2006）格式打印的依赖项，请从命令行尝试以下操作：

java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz stanford-parser-full-2013-06-20/data/testsent.txt >testsent.tree

java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.trees.EnglishGrammaticalStructure -treeFile testsent.tree -conllx

下面是第一个测试句子的输出：

1       Scores        _       NNS     NNS     _       4       nsubj        _       _
2       of            _       IN      IN      _       0       erased       _       _
3       properties    _       NNS     NNS     _       1       prep_of      _       _
4       are           _       VBP     VBP     _       0       root         _       _
5       under         _       IN      IN      _       0       erased       _       _
6       extreme       _       JJ      JJ      _       8       amod         _       _
7       fire          _       NN      NN      _       8       nn           _       _
8       threat        _       NN      NN      _       4       prep_under   _       _
9       as            _       IN      IN      _      13       mark         _       _
10      a             _       DT      DT      _      12       det          _       _
11      huge          _       JJ      JJ      _      12       amod         _       _
12      blaze         _       NN      NN      _      15       xsubj        _       _
13      continues     _       VBZ     VBZ     _       4       advcl        _       _
14      to            _       TO      TO      _      15       aux          _       _
15      advance       _       VB      VB      _      13       xcomp        _       _
16      through       _       IN      IN      _       0       erased       _       _
17      Sydney        _       NNP     NNP     _      20       poss         _       _
18      's            _       POS     POS     _       0       erased       _       _
19      north-western _       JJ      JJ      _      20       amod         _       _
20      suburbs       _       NNS     NNS     _      15       prep_through _       _
21      .             _       .       .       _       4       punct        _       _

有一个

conll2007

输出

下面是一个使用斯坦福解析器3.8版本的示例。它假设每行一句话的输入文件，以斯坦福依赖项（非通用依赖项）输出，不传播/折叠，保留标点，并以conll2007输出：

java -Xmx4g -cp "stanford-corenlp-full-2017-06-09/*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -sentences newline -outputFormat conll2007 -originalDependencies -outputFormatOptions "basicDependencies,includePunctuationDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz input.txt