Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/373.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 如何从解析文本中提取名词短语_Java_Stanford Nlp - Fatal编程技术网

Java 如何从解析文本中提取名词短语

Java 如何从解析文本中提取名词短语,java,stanford-nlp,Java,Stanford Nlp,我已经用选区分析器分析了一个文本,并将结果复制到一个文本文件中,如下所示: (ROOT (S (NP (NN Yesterday)) (, ,) (NP (PRP we)) (VP (VBD went) (PP (TO to).... (ROOT (FRAG (SBAR (SBAR (IN While) (S (NP (PRP I)) (VP (VBD was) (NP (NP (EX... (ROOT (S (NP (NN Yesterday)) (, ,) (NP (PRP I)) (VP

我已经用选区分析器分析了一个文本,并将结果复制到一个文本文件中,如下所示:

(ROOT (S (NP (NN Yesterday)) (, ,) (NP (PRP we)) (VP (VBD went) (PP (TO to)....
(ROOT (FRAG (SBAR (SBAR (IN While) (S (NP (PRP I)) (VP (VBD was) (NP (NP (EX...
(ROOT (S (NP (NN Yesterday)) (, ,) (NP (PRP I)) (VP (VBD went) (PP (TO to.....
(ROOT (FRAG (SBAR (SBAR (IN While) (S (NP (NNP Jim)) (VP (VBD was) (NP (NP (....
(ROOT (S (S (NP (PRP I)) (VP (VBD started) (S (VP (VBG talking) (PP.....
我需要从这个文本文件中提取所有名词短语(NP)。我编写了以下代码,只从每行中提取第一个NP。但是,我需要提取所有名词短语。我的代码是:

public class nounPhrase {

    public static int findClosingParen(char[] text, int openPos) {
        int closePos = openPos;
        int counter = 1;
        while (counter > 0) {
            char c = text[++closePos];
            if (c == '(') {

                counter++;
            }
            else if (c == ')') {
                counter--;
            }
        }
        return closePos;
    }

     public static void main(String[] args) throws IOException {

        ArrayList npList = new ArrayList ();
        String line;
        String line1;
        int np;

        String Input = "/local/Input/Temp/Temp.txt";

        String Output = "/local/Output/Temp/Temp-out.txt";  

        FileInputStream  fis = new FileInputStream (Input);
        BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8"
        ));
        while ((line = br.readLine())!= null){
        char[] lineArray = line.toCharArray();
        np = findClosingParen (lineArray, line.indexOf("(NP"));
        line1 = line.substring(line.indexOf("(NP"),np+1);
        System.out.print(line1+"\n");
        }
    }
}
输出为:

(NP (NN Yesterday))...I need other NPs in this line also
(NP (PRP I)).....I need other NPs in this line also
(NP (NNP Jim)).....I need other NPs in this line also
(NP (PRP I)).....I need other NPs in this line also
我的代码只取每行的第一个NP及其右括号,但我需要从文本中提取所有NP。

您正在构建一个解析器(…针对您的自然语言解析器生成的代码),这是一个具有广泛学术文档的主题。 可以构建的最简单的解析器是LL解析器。看看维基百科上的这篇文章,其中有一些很好的例子供你启发:

wikipedia中有关一般解析的条目可能会让您了解一般解析领域:
维基百科文章:

给你。我对它做了一点改动,结果弄得一团糟,但是如果你真的需要代码,我可以把它清理干净

import java.io.*;
import java.util.*;

public class nounPhrase {
    public static void main(String[] args)throws IOException{

        ArrayList<String> npList = new ArrayList<String>();
        String line = "";
        String line1 = "";

        String Input = "/local/Input/Temp/Temp.txt";
        String Output = "/local/Output/Temp/Temp-out.txt";

        FileInputStream  fis = new FileInputStream (Input);
        BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8"));

        while ((line = br.readLine()) != null){
            char[] lineArray = line.toCharArray();
            int temp;
            for (int i=0; i+2<lineArray.length; i++){
                if(lineArray[i]=='(' && lineArray[i+1]=='N' && lineArray[i+2]=='P'){
                    temp = i;
                    while(lineArray[i] != ')'){
                        i++;
                    }
                    i+=2;
                    line1 = line.substring(temp,i);
                    npList.add(line1);
                }
            }
            npList.add("*");
        }

        for (int i=0; i<npList.size(); i++){
            if(!(npList.get(i).equals("*"))){
                System.out.print(npList.get(i));
                if(i<npList.size()-1 && npList.get(i+1).equals("*")){
                    System.out.println();
                }
            }
        }
    }
} 
import java.io.*;
导入java.util.*;
公共类名词短语{
公共静态void main(字符串[]args)引发IOException{
ArrayList npList=新的ArrayList();
字符串行=”;
字符串line1=“”;
字符串输入=“/local/Input/Temp/Temp.txt”;
字符串输出=“/local/Output/Temp/Temp out.txt”;
FileInputStream fis=新的FileInputStream(输入);
BufferedReader br=新的BufferedReader(新的InputStreamReader(fis,“UTF-8”);
而((line=br.readLine())!=null){
char[]lineArray=line.toCharArray();
内部温度;

对于(int i=0;i+2,在获得第一个NP短语后,您必须在解析树上迭代并更改名词短语的索引,简单的方法是将行变量子串,该子串的开始索引将为NP+1。以下是您可以对代码进行的更改:

while ((line = br.readLine())!= null){
        char[] lineArray = line.toCharArray();
        int indexOfNP = line.indexOf("(NP");
        while(indexOfNP!=-1) {
            np = findClosingParen(lineArray, indexOfNP);
            line1 = line.substring(indexOfNP, np + 1);
            System.out.print(line1 + "\n");
            npList.add(line1);
            line = line.substring(np+1);
            indexOfNP = line.indexOf("(NP");
            lineArray = line.toCharArray();
        }
}
对于递归解决方案:

public static void main(String[] args) throws IOException {

    ArrayList<String> npList = new ArrayList<String>();
    String line;
    String Input = "/local/Input/Temp/Temp.txt";
    String Output = "/local/Output/Temp/Temp-out.txt";

    FileInputStream fis = new FileInputStream (Input);
    BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8"));
    while ((line = br.readLine())!= null){
        int indexOfNP = line.indexOf("(NP");
        if(indexOfNP>=0)
            extractNPs(npList,line,indexOfNP);
    }

    for(String npString:npList){
        System.out.println(npString);
    }

    br.close();
    fis.close();

}

public static ArrayList<String> extractNPs(ArrayList<String> arr,String  
                                                   parse, int indexOfNP){
    if(indexOfNP==-1){
        return arr;
    }
    else{
        int npIndex = findClosingParen(parse.toCharArray(), indexOfNP);
        String mainNP = new String(parse.substring(indexOfNP, npIndex + 1));
        arr.add(mainNP);
        //Uncomment Lines below if you also want MainNP along with all NPs     
        //within MainNP to be extracted
        /*
        mainNP = new String(mainNP.substring(3));
        if(mainNP.indexOf("(NP")>0){
            return extractNPs(arr,mainNP,mainNP.indexOf("(NP"));
        }
        */
        parse = new String(parse.substring(npIndex+1));
        indexOfNP = parse.indexOf("(NP");
        return extractNPs(arr,parse,indexOfNP);
    }
}
publicstaticvoidmain(字符串[]args)引发IOException{
ArrayList npList=新的ArrayList();
弦线;
字符串输入=“/local/Input/Temp/Temp.txt”;
字符串输出=“/local/Output/Temp/Temp out.txt”;
FileInputStream fis=新的FileInputStream(输入);
BufferedReader br=新的BufferedReader(新的InputStreamReader(fis,“UTF-8”);
而((line=br.readLine())!=null){
int indexOfNP=line.indexOf(“(NP”);
如果(indexOfNP>=0)
提取NPS(npList、line、indexOfNP);
}
用于(字符串npString:npList){
系统输出打印项次(npString);
}
br.close();
fis.close();
}
公共静态ArrayList extractNPs(ArrayList arr,字符串
解析,int indexOfNP){
如果(indexOfNP==-1){
返回arr;
}
否则{
int npIndex=findClosingParen(parse.toCharArray(),indexOfNP);
String mainNP=新字符串(parse.substring(indexOfNP,npIndex+1));
arr.add(mainNP);
//如果您还希望MainNP与所有NPs一起使用,请取消注释下面的行
//在要提取的MainNP中
/*
mainNP=新字符串(mainNP.substring(3));
如果(主索引(“(NP”)>0){
返回提取NPS(arr、mainNP、mainNP.indexOf(“(NP”));
}
*/
parse=新字符串(parse.substring(npIndex+1));
indexOfNP=parse.indexOf(“(NP”);
返回extractNPs(arr、parse、indexOfNP);
}
}

虽然编写自己的树解析器是一个很好的练习(!),但如果您只想得到结果,最简单的方法是使用更多斯坦福NLP工具的功能,也就是说,它就是专为这些事情而设计的。您可以将最后的
While
循环更改为以下内容:

TregexPattern tPattern = TregexPattern.compile("NP");
while ((line = br.readLine()) != null) {
    Tree t = Tree.valueOf(line);
    TregexMatcher tMatcher = tPattern.matcher(t);
    while (tMatcher.find()) {
      System.out.println(tMatcher.getMatch());
    }
}

在代码中复制(不带制表符)时,选择整个块并按Ctrl+K组合键,将所有块缩进4个空格,使其成为代码块标记(这样,最后一个大括号也会被包括在内。谢谢永远不应该是一个好问题的一部分。忽略这一点不是不礼貌的,但是当你把它放进去的时候会浪费读者的时间。我的问题不是解析器,而是更多的处理文本。我已经解析了文本,现在我需要提取所有NPs模式。我要说的是实际上,您要做的是解析NP解析器输出。