Java 使用stanford NLP解析器后获取原始文本_Java_Stanford Nlp

Java 使用stanford NLP解析器后获取原始文本

java stanford-nlp

Java 使用stanford NLP解析器后获取原始文本,java,stanford-nlp,Java,Stanford Nlp,各位网友好, 斯坦福NLP API存在以下问题：我们有一个字符串，我们想把它转换成一个句子列表。首先，我们使用了String-sentenceString=句子.listToString（句子）但是由于标记化，listToString不会返回原始文本。现在，我们尝试以以下方式使用listToOriginalTextString： private static List<String> getSentences(String text) { Reader reade

各位网友好,

斯坦福NLP API存在以下问题：我们有一个字符串，我们想把它转换成一个句子列表。首先，我们使用了

String-sentenceString=句子.listToString（句子）但是由于标记化，listToString

不会返回原始文本。现在，我们尝试以以下方式使用

listToOriginalTextString

：

private static List<String> getSentences(String text) {
        Reader reader = new StringReader(text);
        DocumentPreprocessor dp = new DocumentPreprocessor(reader);
        List<String> sentenceList = new ArrayList<String>();

        for (List<HasWord> sentence : dp) {
            String sentenceString = Sentence.listToOriginalTextString(sentence);
            sentenceList.add(sentenceString.toString());
        }

        return sentenceList;
    }

私有静态列表获取语句（字符串文本）{
读卡器=新的StringReader（文本）；
DocumentPreprocessor dp=新文档预处理器（读卡器）；
List sentenceList=新的ArrayList（）；
对于（列表句子：dp）{
String sentenceString=句子。listToOriginalTextString（句子）；
add（sentenceString.toString（））；
}
返回语句表；
}

这是行不通的。显然，我们必须将属性“可逆”设置为true，但我们不知道如何设置。我们怎样才能做到这一点

一般来说，如何正确使用listToOriginalTextString？你需要什么准备

真诚地，

Khayet

如果我理解正确，您希望在标记化之后获得标记到原始输入文本的映射。你可以这样做

        //split via PTBTokenizer (PTBLexer)
        List<CoreLabel> tokens = PTBTokenizer.coreLabelFactory().getTokenizer(new StringReader(text)).tokenize();

        //do the processing using stanford sentence splitter (WordToSentenceProcessor)
        WordToSentenceProcessor processor = new WordToSentenceProcessor();
        List<List<CoreLabel>> splitSentences = processor.process(tokens);

        //for each sentence
        for (List<CoreLabel> s : splitSentences) {                

            //for each word
            for (CoreLabel token : s) {
                //here you can get the token value and position like;
                //token.value(), token.beginPosition(), token.endPosition()
            }    

        }

//通过PTBTokenizer（PTBLexer）拆分
List tokens=PTBTokenizer.corelLabelFactory（）.getTokenizer（新的StringReader（文本））.tokenize（）；
//使用斯坦福句子拆分器（WordToSentenceProcessor）进行处理
WordToSentenceProcessor=新的WordToSenceProcessor（）；
List splitsequences=processor.process（令牌）；
//每句话
对于（列表s：拆分句子）{
//每一个字
用于（CoreLabel令牌：s）{
//在这里您可以获得令牌值和位置，如；
//token.value（），token.beginPosition（），token.endPosition（）
}    
}

如果我理解正确，您希望在标记化后获得标记到原始输入文本的映射。你可以这样做

        //split via PTBTokenizer (PTBLexer)
        List<CoreLabel> tokens = PTBTokenizer.coreLabelFactory().getTokenizer(new StringReader(text)).tokenize();

        //do the processing using stanford sentence splitter (WordToSentenceProcessor)
        WordToSentenceProcessor processor = new WordToSentenceProcessor();
        List<List<CoreLabel>> splitSentences = processor.process(tokens);

        //for each sentence
        for (List<CoreLabel> s : splitSentences) {                

            //for each word
            for (CoreLabel token : s) {
                //here you can get the token value and position like;
                //token.value(), token.beginPosition(), token.endPosition()
            }    

        }

//通过PTBTokenizer（PTBLexer）拆分
List tokens=PTBTokenizer.corelLabelFactory（）.getTokenizer（新的StringReader（文本））.tokenize（）；
//使用斯坦福句子拆分器（WordToSentenceProcessor）进行处理
WordToSentenceProcessor=新的WordToSenceProcessor（）；
List splitsequences=processor.process（令牌）；
//每句话
对于（列表s：拆分句子）{
//每一个字
用于（CoreLabel令牌：s）{
//在这里您可以获得令牌值和位置，如；
//token.value（），token.beginPosition（），token.endPosition（）
}    
}

它提供原始文本。JSONOutputter.java文件的一个示例：

l2.set("id", sentence.get(CoreAnnotations.SentenceIDAnnotation.class));
l2.set("index", sentence.get(CoreAnnotations.SentenceIndexAnnotation.class));
l2.set("sentenceOriginal",sentence.get(CoreAnnotations.TextAnnotation.class));
l2.set("line", sentence.get(CoreAnnotations.LineNumberAnnotation.class));

它提供原始文本。JSONOutputter.java文件的一个示例：

l2.set("id", sentence.get(CoreAnnotations.SentenceIDAnnotation.class));
l2.set("index", sentence.get(CoreAnnotations.SentenceIndexAnnotation.class));
l2.set("sentenceOriginal",sentence.get(CoreAnnotations.TextAnnotation.class));
l2.set("line", sentence.get(CoreAnnotations.LineNumberAnnotation.class));