Stanford nlp 利用CoreNLP提取多词命名实体_Stanford Nlp_Named Entity Recognition

Stanford nlp 利用CoreNLP提取多词命名实体

stanford-nlp

Stanford nlp 利用CoreNLP提取多词命名实体,stanford-nlp,named-entity-recognition,Stanford Nlp,Named Entity Recognition,我使用CoreNLP进行命名实体提取，遇到了一些问题。问题在于，每当一个命名实体由多个标记组成时，例如“Han Solo”，注释者不会将“Han Solo”作为单个命名实体返回，而是将其作为两个单独的实体返回，“Han”“Solo” 是否可以将命名实体作为一个令牌获取？我知道我可以在这种程度上使用CRFClassizer和classifyWithInlineXML，但我的解决方案要求我使用CoreNLP，因为我还需要知道字号以下是我目前掌握的代码： Properties props

我使用CoreNLP进行命名实体提取，遇到了一些问题。问题在于，每当一个命名实体由多个标记组成时，例如“Han Solo”，注释者不会将“Han Solo”作为单个命名实体返回，而是将其作为两个单独的实体返回，“Han”“Solo”

是否可以将命名实体作为一个令牌获取？我知道我可以在这种程度上使用CRFClassizer和classifyWithInlineXML，但我的解决方案要求我使用CoreNLP，因为我还需要知道字号

以下是我目前掌握的代码：

    Properties props = new Properties();
    props.put("annotators", "tokenize,ssplit,pos,lemma,ner,parse");
    props.setProperty("ner.model", "edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz");
    pipeline = new StanfordCoreNLP(props);
    Annotation document = new Annotation(text);
    pipeline.annotate(document);
    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    for (CoreMap sentence : sentences) {
        for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
                System.out.println(token.get(NamedEntityTagAnnotation.class));
        }
    }

Properties=newproperties（）；
props.put（“注释器”、“标记化、ssplit、pos、引理、ner、解析”）；
props.setProperty（“ner.model”、“edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz”）；
管道=新StanfordCoreNLP（道具）；
注释文档=新注释（文本）；
管道注释（文件）；
列出句子=document.get（SentencesAnnotation.class）；
for（CoreMap句子：句子）{
for（CoreLabel标记：句子.get（TokensAnnotation.class））{
System.out.println（token.get（NamedEntityTagAnnotation.class））；
}
}

帮我欧比-万·克诺比。你是我唯一的希望

PrintWriter=null；
试试{
String inputLine=“在纽约联邦储备银行“+”举行的由纽约联储主席蒂莫西·盖特纳（Timothy R.Geithner）和财政部长亨利·保尔森（Henry M.Paulson Jr.）领导的会谈中，出现了几个可能的计划”；
String serializedClassifier=“english.all.3class.distsim.crf.ser.gz”；
AbstractSequenceClassifier=CRFClassizer.getClassifierNoExceptions（serializedClassifier）；
writer=newprintwriter（新文件（“output.xml”）；
writer.println（“”）；
writer.flush（）；
字符串输出=”“+分类器。classifyToString（输入行，“xml”，true）+”；
writer.println（输出）；
writer.flush（）；
writer.println（“”）；
writer.flush（）；
}捕获（FileNotFoundException ex）{
例如printStackTrace（）；
}最后{
writer.close（）；
}

我能想出这个解决办法。我正在将输出写入一个XML文件“output.XML”。从获得的输出中，可以将xml中具有“PERSON”或“ORGANIZATION”或“LOCATION”属性的连续节点合并到一个实体中。默认情况下，此格式生成字数

下面是xml输出的快照

<wi num="11" entity="ORGANIZATION">Federal</wi>
<wi num="12" entity="ORGANIZATION">Reserve</wi>
<wi num="13" entity="ORGANIZATION">Bank</wi>
<wi num="14" entity="ORGANIZATION">of</wi>
<wi num="15" entity="ORGANIZATION">New</wi>
<wi num="16" entity="ORGANIZATION">Yorkand</wi>

联邦政府储备银行属于新的约克兰

从上面的输出中，您可以看到连续的单词被识别为“组织”。因此，这些单词可以组合成一个实体。

我使用一个temp变量来保存前一个ner标记，并检查当前的ner标记是否等于temp，它会将两个单词组合在一起。迭代通过将temp分配给当前的ner标记来进行

呸。我真傻。伟大的解决方案！

<wi num="11" entity="ORGANIZATION">Federal</wi>
<wi num="12" entity="ORGANIZATION">Reserve</wi>
<wi num="13" entity="ORGANIZATION">Bank</wi>
<wi num="14" entity="ORGANIZATION">of</wi>
<wi num="15" entity="ORGANIZATION">New</wi>
<wi num="16" entity="ORGANIZATION">Yorkand</wi>