Lucene自定义分析器-令牌流契约冲突_Lucene

Lucene自定义分析器-令牌流契约冲突

lucene

Lucene自定义分析器-令牌流契约冲突,lucene,Lucene,我试图在Lucene中创建自己的自定义分析器和标记器类。我主要遵循以下说明：我根据需要进行了更新（在Lucene的较新版本中，阅读器存储在“input”中）但我有一个例外：令牌流协定冲突：reset（）/close（）调用丢失，reset（）调用多次，或者子类未调用super.reset（）。有关正确的消费工作流的更多信息，请参阅TokenStream类的Javadocs 这可能是什么原因？我推测调用reset\close根本不是我的工作，而是应该由分析器完成这是我的自定义分析器类：

我试图在Lucene中创建自己的自定义分析器和标记器类。我主要遵循以下说明：

我根据需要进行了更新（在Lucene的较新版本中，阅读器存储在“input”中）

但我有一个例外：

令牌流协定冲突：reset（）/close（）调用丢失，reset（）调用多次，或者子类未调用super.reset（）。有关正确的消费工作流的更多信息，请参阅TokenStream类的Javadocs

这可能是什么原因？我推测调用reset\close根本不是我的工作，而是应该由分析器完成

这是我的自定义分析器类：

public class MyAnalyzer extends Analyzer {

protected  TokenStreamComponents createComponents(String FieldName){
    // TODO Auto-generated method stub
    return new TokenStreamComponents(new MyTokenizer());
}
}

public class MyTokenizer extends Tokenizer {

protected CharTermAttribute charTermAttribute =
        addAttribute(CharTermAttribute.class);

    public MyTokenizer() {
        char[] buffer = new char[1024];
        int numChars;
        StringBuilder stringBuilder = new StringBuilder();
        try {
            while ((numChars =
                this.input.read(buffer, 0, buffer.length)) != -1) {
                stringBuilder.append(buffer, 0, numChars);
            }
        }
        catch (IOException e) {
            throw new RuntimeException(e);
        }

        String StringToTokenize = stringBuilder.toString();
        Terms=Tokenize(StringToTokenize);       
    }

    public boolean incrementToken() throws IOException {

        if(CurrentTerm>=Terms.length)
            return false;
        this.charTermAttribute.setEmpty();
        this.charTermAttribute.append(Terms[CurrentTerm]);
        CurrentTerm++;
        return true;
    }

    static String[] Tokenize(String StringToTokenize){

        //Here I process the string and create an array of terms.
        //I tested this method and it works ok
        //In case it's relevant, I parse the string into terms in the //constructor. Then in IncrementToken I simply iterate over the Terms array and //submit them each at a time.
        return Processed;                   
    }

    public void reset() throws IOException {
        super.reset();
        Terms=null;
        CurrentTerm=0;      
    };

    String[] Terms;
    int CurrentTerm;
}

以及我的自定义标记器类：

public class MyAnalyzer extends Analyzer {

protected  TokenStreamComponents createComponents(String FieldName){
    // TODO Auto-generated method stub
    return new TokenStreamComponents(new MyTokenizer());
}
}

public class MyTokenizer extends Tokenizer {

protected CharTermAttribute charTermAttribute =
        addAttribute(CharTermAttribute.class);

    public MyTokenizer() {
        char[] buffer = new char[1024];
        int numChars;
        StringBuilder stringBuilder = new StringBuilder();
        try {
            while ((numChars =
                this.input.read(buffer, 0, buffer.length)) != -1) {
                stringBuilder.append(buffer, 0, numChars);
            }
        }
        catch (IOException e) {
            throw new RuntimeException(e);
        }

        String StringToTokenize = stringBuilder.toString();
        Terms=Tokenize(StringToTokenize);       
    }

    public boolean incrementToken() throws IOException {

        if(CurrentTerm>=Terms.length)
            return false;
        this.charTermAttribute.setEmpty();
        this.charTermAttribute.append(Terms[CurrentTerm]);
        CurrentTerm++;
        return true;
    }

    static String[] Tokenize(String StringToTokenize){

        //Here I process the string and create an array of terms.
        //I tested this method and it works ok
        //In case it's relevant, I parse the string into terms in the //constructor. Then in IncrementToken I simply iterate over the Terms array and //submit them each at a time.
        return Processed;                   
    }

    public void reset() throws IOException {
        super.reset();
        Terms=null;
        CurrentTerm=0;      
    };

    String[] Terms;
    int CurrentTerm;
}

当我跟踪异常时，我发现问题出在input.read上——似乎input里面什么都没有（或者说，里面有非法的\u STATE\u读取器），我不理解它。

在重置标记器构造函数之前，您正在从输入流中读取

我认为，这里的问题是，您将输入处理为字符串，而不是流。其目的是让您在

incrementToken

方法中高效地读取流，而不是将整个流加载到字符串中，并在开始时预处理一个大的标记列表

不过，走这条路是可能的。只需将构造函数中当前的所有逻辑移到您的

reset

方法中即可（在调用

super.reset（）

之后）。

非常感谢！成功了。过去两天我一直在努力解决这个问题。现在，原因似乎显而易见。如果我可以问的话，为什么处理整个内容然后只对其进行迭代是如此糟糕？我之所以这么做，是因为大部分标记化逻辑都适合于java有用的工具，例如replace和regexes。@miv-只是因为它效率低下，速度较慢，占用的内存更多。