在Java中查找搭配模式
我在一个需要使用搭配的项目中工作。我创建了以下代码来提取它们。代码获取一个字符串并返回该字符串中的搭配模式列表。我已经用斯坦福词组做了标记 我需要你对代码的建议,它似乎非常缓慢,因为我处理大量的文本。 如有任何改进本规范的建议,将不胜感激在Java中查找搭配模式,java,nlp,stanford-nlp,Java,Nlp,Stanford Nlp,我在一个需要使用搭配的项目中工作。我创建了以下代码来提取它们。代码获取一个字符串并返回该字符串中的搭配模式列表。我已经用斯坦福词组做了标记 我需要你对代码的建议,它似乎非常缓慢,因为我处理大量的文本。 如有任何改进本规范的建议,将不胜感激 /** * * A COLLOCATION is an expression consisting of two or more words that * correspond to some conventional way of saying thin
/**
*
* A COLLOCATION is an expression consisting of two or more words that
* correspond to some conventional way of saying things.
*
* I used the seventh Part-of-speech-tag patterns for collocation filtering that
* were suggested by Justeson and Katz(1995).
* These patterns are:
*
* -----------------------------------------
* |Tag | Pattern Example |
* -----------------------------------------
* |AN | linear function |
* |NN | regression coefficients |
* |AAN | Gaussian random variable |
* |ANN | cumulative distribution function |
* |NAN | mean squared error |
* |NNN | class probability function |
* |NPN | degrees of freedom |
* -----------------------------------------
* Where A=adjective, P=preposition, & N=noun.
*
* Stanford POS have been used for the extraction process.
* see: http://nlp.stanford.edu/software/tagger.shtml#Download
*
* more on collocation: http://nlp.stanford.edu/fsnlp/promo/colloc.pdf
* more on POS: http://acl.ldc.upenn.edu/J/J93/J93-2004.pdf
*
*/
public class GetCollocations {
public static ArrayList<String> GetCollocations(String text) throws IOException, ClassNotFoundException{
MaxentTagger tagger = new MaxentTagger("taggers/wsj-0-18-left3words.tagger");
String[] tagged = tagger.tagString(text).split("\\s+");
ArrayList<String> collocations = new ArrayList();
for (int i = 0; i < tagged.length; i++) {
String pot = tagged[i].substring(tagged[i].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
pot = tagged[i + 1].substring(tagged[i + 1].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]));
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}
} else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}
} else if (pot.equals("IN")) {
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}
}
} else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
pot = tagged[i + 1].substring(tagged[i + 1].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]));
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}
} else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}
}
}
}
return collocations;
}
public static String GetWordWithoutTag(String wordWithTag){
String wordWithoutTag = wordWithTag.substring(0,wordWithTag.indexOf("_"));
return wordWithoutTag;
}
}
/**
*
*搭配是由两个或两个以上的单词组成的表达
*与某种传统的说话方式相对应。
*
*我使用第七部分的词性标记模式进行搭配过滤
*由Justeson和Katz(1995)提出。
*这些模式是:
*
* -----------------------------------------
*|标记|图案示例|
* -----------------------------------------
*|一个|线性函数|
*| NN |回归系数|
*| AAN |高斯随机变量|
*| ANN |累积分布函数|
*| NAN |均方误差|
*| NNN |类概率函数|
*| NPN |自由度|
* -----------------------------------------
*其中A=形容词,P=介词,N=名词。
*
*斯坦福POS已用于提取过程。
*见:http://nlp.stanford.edu/software/tagger.shtml#Download
*
*关于搭配的更多信息:http://nlp.stanford.edu/fsnlp/promo/colloc.pdf
*有关POS的更多信息:http://acl.ldc.upenn.edu/J/J93/J93-2004.pdf
*
*/
公共类配置{
公共静态ArrayList GetCollabons(字符串文本)抛出IOException、ClassNotFoundException{
MaxentTagger tagger=新的MaxentTagger(“taggers/wsj-0-18-left3words.tagger”);
String[]taged=tagger.tagString(text).split(\\s+);
ArrayList配置=新的ArrayList();
对于(int i=0;i
如果你每秒处理近15000个单词,那么你的POS标记器将达到最大值。斯坦福大学表示:
算法的其余部分看起来不错,不过如果您真的想从中榨取一些资源,可以将数组作为静态类变量而不是ArrayList预先分配。基本上牺牲了前期内存成本,不必在每次调用时实例化ArrayList,也不必忍受添加元素的痛苦
还有一个关于提高代码可读性的建议,您可以考虑使用一些私有的方法来检查词类< <代码> POT/<代码>变量,
private static Boolean _isNoun(String pot) {
if(pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) return true;
else return false;
}
private static Boolean _isAdjective(String pot){
if(pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) return true;
else return false;
}
另外,如果我没有误解的话,您应该能够简化您正在做的事情,结合一些if
语句。这并不会真正加快代码的速度,但会使代码更易于使用。请仔细阅读,我只是试图简化你的逻辑来证明我的观点。留在密苏里州
private static Boolean _isNoun(String pot) {
if(pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) return true;
else return false;
}
private static Boolean _isAdjective(String pot){
if(pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) return true;
else return false;
}
public static ArrayList<String> GetCollocations(String text) throws IOException, ClassNotFoundException{
MaxentTagger tagger = new MaxentTagger("taggers/wsj-0-18-left3words.tagger");
String[] tagged = tagger.tagString(text).split("\\s+");
ArrayList<String> collocations = new ArrayList();
for (int i = 0; i < tagged.length; i++) {
String pot = tagged[i].substring(tagged[i].indexOf("_") + 1);
if (_isNoun(pot) || _isAdjective(pot)) {
pot = tagged[i + 1].substring(tagged[i + 1].indexOf("_") + 1);
if (_isNoun(pot) || _isAdjective(pot)) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]));
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (_isNoun(pot)) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}
} else if (pot.equals("IN")) {
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (_isNoun(pot)) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}
}
}
}
return collocations;
}