Lucene自定义分析器-令牌流契约冲突

Lucene自定义分析器-令牌流契约冲突,lucene,Lucene,我试图在Lucene中创建自己的自定义分析器和标记器类。我主要遵循以下说明: 我根据需要进行了更新(在Lucene的较新版本中,阅读器存储在“input”中) 但我有一个例外: 令牌流协定冲突:reset()/close()调用丢失,reset()调用多次,或者子类未调用super.reset()。有关正确的消费工作流的更多信息,请参阅TokenStream类的Javadocs 这可能是什么原因?我推测调用reset\close根本不是我的工作,而是应该由分析器完成 这是我的自定义分析器类:

我试图在Lucene中创建自己的自定义分析器和标记器类。我主要遵循以下说明:

我根据需要进行了更新(在Lucene的较新版本中,阅读器存储在“input”中)

但我有一个例外:

令牌流协定冲突:reset()/close()调用丢失,reset()调用多次,或者子类未调用super.reset()。有关正确的消费工作流的更多信息,请参阅TokenStream类的Javadocs

这可能是什么原因?我推测调用reset\close根本不是我的工作,而是应该由分析器完成

这是我的自定义分析器类:

public class MyAnalyzer extends Analyzer {

protected  TokenStreamComponents createComponents(String FieldName){
    // TODO Auto-generated method stub
    return new TokenStreamComponents(new MyTokenizer());
}
}
public class MyTokenizer extends Tokenizer {

protected CharTermAttribute charTermAttribute =
        addAttribute(CharTermAttribute.class);

    public MyTokenizer() {
        char[] buffer = new char[1024];
        int numChars;
        StringBuilder stringBuilder = new StringBuilder();
        try {
            while ((numChars =
                this.input.read(buffer, 0, buffer.length)) != -1) {
                stringBuilder.append(buffer, 0, numChars);
            }
        }
        catch (IOException e) {
            throw new RuntimeException(e);
        }

        String StringToTokenize = stringBuilder.toString();
        Terms=Tokenize(StringToTokenize);       
    }

    public boolean incrementToken() throws IOException {

        if(CurrentTerm>=Terms.length)
            return false;
        this.charTermAttribute.setEmpty();
        this.charTermAttribute.append(Terms[CurrentTerm]);
        CurrentTerm++;
        return true;
    }

    static String[] Tokenize(String StringToTokenize){

        //Here I process the string and create an array of terms.
        //I tested this method and it works ok
        //In case it's relevant, I parse the string into terms in the //constructor. Then in IncrementToken I simply iterate over the Terms array and //submit them each at a time.
        return Processed;                   
    }

    public void reset() throws IOException {
        super.reset();
        Terms=null;
        CurrentTerm=0;      
    };

    String[] Terms;
    int CurrentTerm;
}
以及我的自定义标记器类:

public class MyAnalyzer extends Analyzer {

protected  TokenStreamComponents createComponents(String FieldName){
    // TODO Auto-generated method stub
    return new TokenStreamComponents(new MyTokenizer());
}
}
public class MyTokenizer extends Tokenizer {

protected CharTermAttribute charTermAttribute =
        addAttribute(CharTermAttribute.class);

    public MyTokenizer() {
        char[] buffer = new char[1024];
        int numChars;
        StringBuilder stringBuilder = new StringBuilder();
        try {
            while ((numChars =
                this.input.read(buffer, 0, buffer.length)) != -1) {
                stringBuilder.append(buffer, 0, numChars);
            }
        }
        catch (IOException e) {
            throw new RuntimeException(e);
        }

        String StringToTokenize = stringBuilder.toString();
        Terms=Tokenize(StringToTokenize);       
    }

    public boolean incrementToken() throws IOException {

        if(CurrentTerm>=Terms.length)
            return false;
        this.charTermAttribute.setEmpty();
        this.charTermAttribute.append(Terms[CurrentTerm]);
        CurrentTerm++;
        return true;
    }

    static String[] Tokenize(String StringToTokenize){

        //Here I process the string and create an array of terms.
        //I tested this method and it works ok
        //In case it's relevant, I parse the string into terms in the //constructor. Then in IncrementToken I simply iterate over the Terms array and //submit them each at a time.
        return Processed;                   
    }

    public void reset() throws IOException {
        super.reset();
        Terms=null;
        CurrentTerm=0;      
    };

    String[] Terms;
    int CurrentTerm;
}

当我跟踪异常时,我发现问题出在input.read上——似乎input里面什么都没有(或者说,里面有非法的\u STATE\u读取器),我不理解它。

在重置标记器构造函数之前,您正在从输入流中读取

我认为,这里的问题是,您将输入处理为字符串,而不是流。其目的是让您在
incrementToken
方法中高效地读取流,而不是将整个流加载到字符串中,并在开始时预处理一个大的标记列表


不过,走这条路是可能的。只需将构造函数中当前的所有逻辑移到您的
reset
方法中即可(在调用
super.reset()
之后)。

非常感谢!成功了。过去两天我一直在努力解决这个问题。现在,原因似乎显而易见。如果我可以问的话,为什么处理整个内容然后只对其进行迭代是如此糟糕?我之所以这么做,是因为大部分标记化逻辑都适合于java有用的工具,例如replace和regexes。@miv-只是因为它效率低下,速度较慢,占用的内存更多。