什么';这个Lucene令牌过滤器怎么了?
免责声明:在过去的41个小时中,我已经编写了36个。我头痛。我不明白为什么这个组合令牌过滤器返回2个令牌,都是来自源流的第一个令牌什么';这个Lucene令牌过滤器怎么了?,lucene,tokenize,Lucene,Tokenize,免责声明:在过去的41个小时中,我已经编写了36个。我头痛。我不明白为什么这个组合令牌过滤器返回2个令牌,都是来自源流的第一个令牌 public class TokenCombiner extends TokenFilter { /* * Recombines all tokens back into a single token using the specified delimiter. */ public TokenCombiner(TokenStream in, i
public class TokenCombiner extends TokenFilter {
/*
* Recombines all tokens back into a single token using the specified delimiter.
*/
public TokenCombiner(TokenStream in, int delimiter) {
super(in);
this.delimiter = delimiter;
}
int delimiter;
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
private boolean firstToken = true;
int startOffset = 0;
@Override
public final boolean incrementToken() throws IOException {
while (true){
boolean eos = input.incrementToken(); //We have to process tokens even if they return end of file.
CharTermAttribute token = input.getAttribute(CharTermAttribute.class);
if (eos && token.length() == 0) break; //Break early to avoid extra whitespace.
if (firstToken){
startOffset = input.getAttribute(OffsetAttribute.class).startOffset();
firstToken = false;
}else{
termAtt.append(Character.toString((char)delimiter));
}
termAtt.append(token);
if (eos) break;
}
offsetAtt.setOffset(startOffset, input.getAttribute(OffsetAttribute.class).endOffset());
return false;
}
@Override
public void reset() throws IOException {
super.reset();
firstToken = true;
startOffset = 0;
}
}
我认为这里的基本问题是,您必须认识到TokenCombiner和它使用的生产者(输入)共享并重用相同的属性!所以token==termAtt always(尝试添加断言!) 伙计,如果你在周末花了36个小时写代码,那就糟透了。。。试试这个:
public class TokenCombiner extends TokenFilter {
private final StringBuilder sb = new StringBuilder();
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
private final char separator;
private boolean consumed; // true if we already consumed
protected TokenCombiner(TokenStream input, char separator) {
super(input);
this.separator = separator;
}
@Override
public final boolean incrementToken() throws IOException {
if (consumed) {
return false; // don't call input.incrementToken() after it returns false
}
consumed = true;
int startOffset = 0;
int endOffset = 0;
boolean found = false; // true if we actually consumed any tokens
while (input.incrementToken()) {
if (!found) {
startOffset = offsetAtt.startOffset();
found = true;
}
sb.append(termAtt);
sb.append(separator);
endOffset = offsetAtt.endOffset();
}
if (found) {
assert sb.length() > 0; // always: because we append separator
sb.setLength(sb.length() - 1);
clearAttributes();
termAtt.setEmpty().append(sb);
offsetAtt.setOffset(startOffset, endOffset);
return true;
} else {
return false;
}
}
@Override
public void reset() throws IOException {
super.reset();
sb.setLength(0);
consumed = false;
}
}
我认为这里的基本问题是,您必须认识到TokenCombiner和它使用的生产者(输入)共享并重用相同的属性!所以token==termAtt always(尝试添加断言!) 伙计,如果你在周末花了36个小时写代码,那就糟透了。。。试试这个:
public class TokenCombiner extends TokenFilter {
private final StringBuilder sb = new StringBuilder();
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
private final char separator;
private boolean consumed; // true if we already consumed
protected TokenCombiner(TokenStream input, char separator) {
super(input);
this.separator = separator;
}
@Override
public final boolean incrementToken() throws IOException {
if (consumed) {
return false; // don't call input.incrementToken() after it returns false
}
consumed = true;
int startOffset = 0;
int endOffset = 0;
boolean found = false; // true if we actually consumed any tokens
while (input.incrementToken()) {
if (!found) {
startOffset = offsetAtt.startOffset();
found = true;
}
sb.append(termAtt);
sb.append(separator);
endOffset = offsetAtt.endOffset();
}
if (found) {
assert sb.length() > 0; // always: because we append separator
sb.setLength(sb.length() - 1);
clearAttributes();
termAtt.setEmpty().append(sb);
offsetAtt.setOffset(startOffset, endOffset);
return true;
} else {
return false;
}
}
@Override
public void reset() throws IOException {
super.reset();
sb.setLength(0);
consumed = false;
}
}
谢谢那么,即使当前令牌是最后一个可用令牌,incrementToken()是否也应该返回true?(顺便说一句,有没有关于设计TokenStreams的文档?javadocs没有包含足够的详细信息(如下图所示)。请滚动到的底部,查看中描述的工作流…让我知道我们应该向文档中添加哪些详细信息!谢谢!那么即使当前令牌是最后一个可用的令牌,incrementToken()是否也应该返回true?(顺便说一句,有关于设计TokenStreams的文档吗?javadocs没有包含足够的详细信息(如下图所示)。请滚动到的底部,查看中描述的工作流……让我知道我们应该向文档中添加哪些详细信息!