如何在lucene 5.0中使用ngram标记器?

如何在lucene 5.0中使用ngram标记器?,lucene,nlp,Lucene,Nlp,我想为字符串生成ngram字符。下面是我用于它的Lucene 4.1库 Reader reader = new StringReader(text); NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 3, 5); //catch contiguous sequence of 3, 4 and 5 characters CharTermAttribute charTermAttribute = gramT

我想为字符串生成ngram字符。下面是我用于它的Lucene 4.1库

    Reader reader = new StringReader(text);
    NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 3, 5); //catch contiguous sequence of 3, 4 and 5 characters

    CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class);

    while (gramTokenizer.incrementToken()) {
        String token = charTermAttribute.toString();
        System.out.println(token);}
但是,我想使用Lucene 5.0.0来实现这一点。Lucene 5.0.0中的NGramTokenizer与以前的版本相比有很大变化,请参阅

有人知道如何使用Lucene 5.0.0进行ngrams吗?

以下代码:

  StringReader stringReader = new StringReader("abcd");
  NGramTokenizer tokenizer = new NGramTokenizer(1, 2);
  tokenizer.setReader(stringReader);
  tokenizer.reset();
  CharTermAttribute termAtt = tokenizer.getAttribute(CharTermAttribute.class);
  while (tokenizer.incrementToken()) {
    String token = termAtt.toString();
    System.out.println(token);
  }
将产生:

a
ab
b
bc
c
cd
d

谢谢它起作用了。很好了解tokenizer.setReader(stringReader)方法,该方法可以读取stringReader。