Lucene自定义分析器-令牌流契约冲突
我试图在Lucene中创建自己的自定义分析器和标记器类。我主要遵循以下说明: 我根据需要进行了更新(在Lucene的较新版本中,阅读器存储在“input”中) 但我有一个例外: 令牌流协定冲突:reset()/close()调用丢失,reset()调用多次,或者子类未调用super.reset()。有关正确的消费工作流的更多信息,请参阅TokenStream类的Javadocs 这可能是什么原因?我推测调用reset\close根本不是我的工作,而是应该由分析器完成 这是我的自定义分析器类:Lucene自定义分析器-令牌流契约冲突,lucene,Lucene,我试图在Lucene中创建自己的自定义分析器和标记器类。我主要遵循以下说明: 我根据需要进行了更新(在Lucene的较新版本中,阅读器存储在“input”中) 但我有一个例外: 令牌流协定冲突:reset()/close()调用丢失,reset()调用多次,或者子类未调用super.reset()。有关正确的消费工作流的更多信息,请参阅TokenStream类的Javadocs 这可能是什么原因?我推测调用reset\close根本不是我的工作,而是应该由分析器完成 这是我的自定义分析器类:
public class MyAnalyzer extends Analyzer {
protected TokenStreamComponents createComponents(String FieldName){
// TODO Auto-generated method stub
return new TokenStreamComponents(new MyTokenizer());
}
}
public class MyTokenizer extends Tokenizer {
protected CharTermAttribute charTermAttribute =
addAttribute(CharTermAttribute.class);
public MyTokenizer() {
char[] buffer = new char[1024];
int numChars;
StringBuilder stringBuilder = new StringBuilder();
try {
while ((numChars =
this.input.read(buffer, 0, buffer.length)) != -1) {
stringBuilder.append(buffer, 0, numChars);
}
}
catch (IOException e) {
throw new RuntimeException(e);
}
String StringToTokenize = stringBuilder.toString();
Terms=Tokenize(StringToTokenize);
}
public boolean incrementToken() throws IOException {
if(CurrentTerm>=Terms.length)
return false;
this.charTermAttribute.setEmpty();
this.charTermAttribute.append(Terms[CurrentTerm]);
CurrentTerm++;
return true;
}
static String[] Tokenize(String StringToTokenize){
//Here I process the string and create an array of terms.
//I tested this method and it works ok
//In case it's relevant, I parse the string into terms in the //constructor. Then in IncrementToken I simply iterate over the Terms array and //submit them each at a time.
return Processed;
}
public void reset() throws IOException {
super.reset();
Terms=null;
CurrentTerm=0;
};
String[] Terms;
int CurrentTerm;
}
以及我的自定义标记器类:
public class MyAnalyzer extends Analyzer {
protected TokenStreamComponents createComponents(String FieldName){
// TODO Auto-generated method stub
return new TokenStreamComponents(new MyTokenizer());
}
}
public class MyTokenizer extends Tokenizer {
protected CharTermAttribute charTermAttribute =
addAttribute(CharTermAttribute.class);
public MyTokenizer() {
char[] buffer = new char[1024];
int numChars;
StringBuilder stringBuilder = new StringBuilder();
try {
while ((numChars =
this.input.read(buffer, 0, buffer.length)) != -1) {
stringBuilder.append(buffer, 0, numChars);
}
}
catch (IOException e) {
throw new RuntimeException(e);
}
String StringToTokenize = stringBuilder.toString();
Terms=Tokenize(StringToTokenize);
}
public boolean incrementToken() throws IOException {
if(CurrentTerm>=Terms.length)
return false;
this.charTermAttribute.setEmpty();
this.charTermAttribute.append(Terms[CurrentTerm]);
CurrentTerm++;
return true;
}
static String[] Tokenize(String StringToTokenize){
//Here I process the string and create an array of terms.
//I tested this method and it works ok
//In case it's relevant, I parse the string into terms in the //constructor. Then in IncrementToken I simply iterate over the Terms array and //submit them each at a time.
return Processed;
}
public void reset() throws IOException {
super.reset();
Terms=null;
CurrentTerm=0;
};
String[] Terms;
int CurrentTerm;
}
当我跟踪异常时,我发现问题出在input.read上——似乎input里面什么都没有(或者说,里面有非法的\u STATE\u读取器),我不理解它。在重置标记器构造函数之前,您正在从输入流中读取 我认为,这里的问题是,您将输入处理为字符串,而不是流。其目的是让您在
incrementToken
方法中高效地读取流,而不是将整个流加载到字符串中,并在开始时预处理一个大的标记列表
不过,走这条路是可能的。只需将构造函数中当前的所有逻辑移到您的
reset
方法中即可(在调用super.reset()
之后)。非常感谢!成功了。过去两天我一直在努力解决这个问题。现在,原因似乎显而易见。如果我可以问的话,为什么处理整个内容然后只对其进行迭代是如此糟糕?我之所以这么做,是因为大部分标记化逻辑都适合于java有用的工具,例如replace和regexes。@miv-只是因为它效率低下,速度较慢,占用的内存更多。