Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/389.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/search/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 标准化标记器行为_Java_Search_Lucene_Tokenize - Fatal编程技术网

Java 标准化标记器行为

Java 标准化标记器行为,java,search,lucene,tokenize,Java,Search,Lucene,Tokenize,给定代码在Lucene 3.0.1下运行 import java.io.*; import org.apache.lucene.analysis.*; import org.apache.lucene.util.Version; public class MyAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { return

给定代码在Lucene 3.0.1下运行

import java.io.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.util.Version;

public class MyAnalyzer extends Analyzer {

   public TokenStream tokenStream(String fieldName, Reader reader) {
       return
               new StopFilter(
                       true,
                       new StandardTokenizer(Version.LUCENE_30, reader),
                       StopAnalyzer.ENGLISH_STOP_WORDS_SET
               );
   }

   private static void printTokens(String string) throws IOException {
       TokenStream ts = new MyAnalyzer().tokenStream("default", new
StringReader(string));
       TermAttribute termAtt = ts.getAttribute(TermAttribute.class);
       while(ts.incrementToken()) {
           System.out.print(termAtt.term());
           System.out.print(" ");
       }
       System.out.println();
   }

   public static void main(String[] args) throws IOException {
       printTokens("one_two_three");           // prints "one two three"
       printTokens("four4_five5_six6");        // prints "four4_five5_six6"
       printTokens("seven7_eight_nine");       // prints "seven7_eight nine"
       printTokens("ten_eleven11_twelve");     // prints "ten_eleven11_twelve"
   }
}
我能理解为什么
one\u two\u three
four4\u five5\u six6
被标记化,正如在中所解释的那样。但另外两个案子更微妙,我 我不太确定我是否明白这个想法

Q1:如果
7
之后出现
7
使其与
8
联合标记 但是如果把
9
分开,为什么
10
会粘在
eleven11

Q2:是否有任何标准和/或简单的方法可以始终创建
标准标记器

在下划线上拆分?

这是一个有趣的发现。我不太清楚如何解释为什么它会在第一季度这样做。但是,我可以为第2季度的剩余下划线提供拆分代码:

public class MyAnalyzer extends Analyzer {
    public TokenStream tokenStream(String fieldName, Reader reader) {
        StandardTokenizer tokenizer = new StandardTokenizer(
                Version.LUCENE_30, reader);
        TokenStream tokenStream = new StandardFilter(tokenizer);
        tokenStream = new MyTokenFilter(tokenStream);
        tokenStream = new StopFilter(true, tokenStream,
                StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        return tokenStream;
    }
}

public class MyTokenFilter extends TokenFilter {
    private final TermAttribute termAttr;
    private String[] terms;
    private int pos;

    public MyTokenFilter(TokenStream tokenStream) {
        super(tokenStream);
        this.termAttr = input.addAttribute(TermAttribute.class);
    }

    public boolean incrementToken() throws IOException {
        if (terms == null) {
            if (!input.incrementToken()) {
                return false;
            }
            terms = termAttr.term().split("_");
        }

        termAttr.setTermBuffer(terms[pos++]);
        if (pos == terms.length) {
            terms = null;
            pos = 0;
        }
        return true;
    }
}

但是这样做,我认为术语的起始偏移量和结束偏移量是错误的,不是吗?