Java 标准化标记器行为
给定代码在Lucene 3.0.1下运行Java 标准化标记器行为,java,search,lucene,tokenize,Java,Search,Lucene,Tokenize,给定代码在Lucene 3.0.1下运行 import java.io.*; import org.apache.lucene.analysis.*; import org.apache.lucene.util.Version; public class MyAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { return
import java.io.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.util.Version;
public class MyAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
return
new StopFilter(
true,
new StandardTokenizer(Version.LUCENE_30, reader),
StopAnalyzer.ENGLISH_STOP_WORDS_SET
);
}
private static void printTokens(String string) throws IOException {
TokenStream ts = new MyAnalyzer().tokenStream("default", new
StringReader(string));
TermAttribute termAtt = ts.getAttribute(TermAttribute.class);
while(ts.incrementToken()) {
System.out.print(termAtt.term());
System.out.print(" ");
}
System.out.println();
}
public static void main(String[] args) throws IOException {
printTokens("one_two_three"); // prints "one two three"
printTokens("four4_five5_six6"); // prints "four4_five5_six6"
printTokens("seven7_eight_nine"); // prints "seven7_eight nine"
printTokens("ten_eleven11_twelve"); // prints "ten_eleven11_twelve"
}
}
我能理解为什么one\u two\u three
和four4\u five5\u six6
被标记化,正如在中所解释的那样。但另外两个案子更微妙,我
我不太确定我是否明白这个想法
Q1:如果7
之后出现7
使其与8
联合标记
但是如果把9
分开,为什么10
会粘在eleven11
上
Q2:是否有任何标准和/或简单的方法可以始终创建标准标记器
在下划线上拆分?这是一个有趣的发现。我不太清楚如何解释为什么它会在第一季度这样做。但是,我可以为第2季度的剩余下划线提供拆分代码:
public class MyAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
StandardTokenizer tokenizer = new StandardTokenizer(
Version.LUCENE_30, reader);
TokenStream tokenStream = new StandardFilter(tokenizer);
tokenStream = new MyTokenFilter(tokenStream);
tokenStream = new StopFilter(true, tokenStream,
StopAnalyzer.ENGLISH_STOP_WORDS_SET);
return tokenStream;
}
}
public class MyTokenFilter extends TokenFilter {
private final TermAttribute termAttr;
private String[] terms;
private int pos;
public MyTokenFilter(TokenStream tokenStream) {
super(tokenStream);
this.termAttr = input.addAttribute(TermAttribute.class);
}
public boolean incrementToken() throws IOException {
if (terms == null) {
if (!input.incrementToken()) {
return false;
}
terms = termAttr.term().split("_");
}
termAttr.setTermBuffer(terms[pos++]);
if (pos == terms.length) {
terms = null;
pos = 0;
}
return true;
}
}
但是这样做,我认为术语的起始偏移量和结束偏移量是错误的,不是吗?