Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 提取名词词&;词性标记的原句_Java_Regex_Nlp_Opennlp - Fatal编程技术网

Java 提取名词词&;词性标记的原句

Java 提取名词词&;词性标记的原句,java,regex,nlp,opennlp,Java,Regex,Nlp,Opennlp,我想从句子中提取名词,并从POS标记中返回原始句子 //Extract the words before _NNP & _NN from below and also how to get back the original sentence from the Pos TAG. Original Sentence:Hi. How are you? This is Mike· POSTag: Hi._NNP How_WRB are_VBP you?_JJ This_DT is_V

我想从句子中提取名词,并从POS标记中返回原始句子

 //Extract the words before _NNP & _NN from below  and also how to get back the original sentence from the Pos TAG. 
 Original Sentence:Hi. How are you? This is Mike·
 POSTag: Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN
我试过这样的东西

    String txt = "Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN";


    String re1 = "((?:[a-z][a-z0-9_]*))";   // Variable Name 1
    String re2 = ".*?"; // Non-greedy match on filler
    String re3 = "(_)"; // Any Single Character 1
    String re4 = "(NNP)";   // Word 1

    Pattern p = Pattern.compile(re1 + re2 + re3 + re4, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
    Matcher m = p.matcher(txt);
    if (m.find()) {
        String var1 = m.group(1);
        System.out.print(  var1.toString()  );
    }
}
输出:Hi 但是我需要一个句子中所有名词的列表。

使用
while(m.find())
而不是
if(m.find())
来迭代所有匹配项

此外,您的正则表达式可以真正简化:

  • 如果不需要捕获数据,只需不加括号(通常)
  • 您使用的是
    ((?:…)
    ,这很奇怪:直接嵌套在捕获组中的非捕获组没有意义
  • 我不确定
    *?
    部分是否符合您的期望。如果要匹配点,请使用
    [.]
因此,请尝试使用
([a-z][a-z0-9.]*)[.]\u NNP

或者甚至使用正向前瞻:
[a-z][a-z0-9\]*(?=[.]\u NNP)
。使用
m.group()
访问捕获的数据。

要提取名词,可以执行以下操作:

public static String[] extractNouns(String sentenceWithTags) {
    // Split String into array of Strings whenever there is a tag that starts with "._NN"
    // followed by zero, one or two more letters (like "_NNP", "_NNPS", or "_NNS")
    String[] nouns = sentenceWithTags.split("_NN\\w?\\w?\\b");
    // remove all but last word (which is the noun) in every String in the array
    for(int index = 0; index < nouns.length; index++) {
        nouns[index] = nouns[index].substring(nouns[index].lastIndexOf(" ") + 1)
        // Remove all non-word characters from extracted Nouns
        .replaceAll("[^\\p{L}\\p{Nd}]", "");
    }
    return nouns;
}
public static String extractOriginal(String sentenceWithTags) {
    return sentenceWithTags.replaceAll("_([A-Z]*)\\b", "");
}
证明其有效的证据:

public static void main(String[] args) {
    String sentence = "Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN";
    System.out.println(java.util.Arrays.toString(extractNouns(sentence)));
    System.out.println(extractOriginal(sentence));
}
输出:

注意:对于从提取的名词中删除所有非单词字符(如标点符号)的正则表达式,我使用了。

这个应该可以

import java.util.ArrayList;
public class Test {

public static final String NOUN_REGEX = "[a-zA-Z]*_NN\\w?\\w?\\b";

public static ArrayList<String> extractNounsByRegex(String sentenceWithTags) {
    ArrayList<String> nouns = new ArrayList<String>();
    String[] words = sentenceWithTags.split("\\s+");
    for (int i = 0; i < words.length; i++) {
        if(words[i].matches(NOUN_REGEX)) {
            System.out.println(" Matched ");
            //remove the suffix _NN* and retain  [a-zA-Z]*
                nouns.add(words[i].replaceAll("_NN\\w?\\w?\\b", ""));
            }
        }
        return nouns;
    }

    public static String extractOriginal(String word) {
                return word.replaceAll("_NN\\w?\\w?\\b", "");
    }

    public static void main(String[] args) {
        //        String sentence = "Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN";
        String sentence = "Eiffel_NNP tower_NN is_VBZ in_IN paris_NN Hi_NNP How_WRB are_VBP you_PRP This_DT is_VBZ Mike_NNP Barrack_NNP Obama_NNP is_VBZ a_DT president_NN this_VBZ";
        System.out.println(extractNounsByRegex(sentence).toString());
        System.out.println(sentence);
    }
}
import java.util.ArrayList;
公开课考试{
公共静态最终字符串名词\u REGEX=“[a-zA-Z]*\u NN\\w?\\w?\\b”;
公共静态ArrayList extractNounsByRegex(带标记的字符串语句){
ArrayList名词=新的ArrayList();
String[]words=sentenceWithTags.split(\\s+);
for(int i=0;i
您试过什么了吗
[a-zA-Z](?=[.]\u NN)
将捕获任何后跟
\u NN
的字母字符字符串,也许您可以从它开始。感谢您的回复。您的示例中有一个输入错误。在第一块中,“Mike.”后面跟着“_NN”,但在第二块中它后面跟着“_NNP”。这是一种名词类型,是专有名词我得到了词性标签,它给了我不同类型的名词谢谢你的回答James,它非常适合名词NN,但我需要所有这些NN-名词,单数或复数,NNP专有名词,单数,NNPS专有名词,复数,NNS名词,复数,这不是句子中的拼写错误。我修正了拼写错误,但我如何分割不同的名词,因为目前它只以名词结尾_NN分割,我还需要提取带有_NNP,_NNPS,_NNS的名词。@srp哦,好的。在这种情况下,只需将“\\w?\\w?”添加到正则表达式“\u NN\\b”的末尾即可。“\\w”查找一个单词字符,“?”表示零次或一次出现,因此将查找“\u NN”,后跟零、一或两个单词字符。更新答案。@srp EDIT:在extractOriginal()方法中修改了正则表达式,现在更加健壮了。@James,你在评论中说//除去数组中每个字符串中的最后一个单词(即名词)以外的所有单词,这就是始终将最后一个单词添加到名词数组中,即使它不是名词,并且并非每次都强制要求最后一个单词是名词。我如何消除这种情况?你能告诉我吗
import java.util.ArrayList;
public class Test {

public static final String NOUN_REGEX = "[a-zA-Z]*_NN\\w?\\w?\\b";

public static ArrayList<String> extractNounsByRegex(String sentenceWithTags) {
    ArrayList<String> nouns = new ArrayList<String>();
    String[] words = sentenceWithTags.split("\\s+");
    for (int i = 0; i < words.length; i++) {
        if(words[i].matches(NOUN_REGEX)) {
            System.out.println(" Matched ");
            //remove the suffix _NN* and retain  [a-zA-Z]*
                nouns.add(words[i].replaceAll("_NN\\w?\\w?\\b", ""));
            }
        }
        return nouns;
    }

    public static String extractOriginal(String word) {
                return word.replaceAll("_NN\\w?\\w?\\b", "");
    }

    public static void main(String[] args) {
        //        String sentence = "Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NN";
        String sentence = "Eiffel_NNP tower_NN is_VBZ in_IN paris_NN Hi_NNP How_WRB are_VBP you_PRP This_DT is_VBZ Mike_NNP Barrack_NNP Obama_NNP is_VBZ a_DT president_NN this_VBZ";
        System.out.println(extractNounsByRegex(sentence).toString());
        System.out.println(sentence);
    }
}