Java 在Lucene中访问位置匹配周围的单词

Java 在Lucene中访问位置匹配周围的单词,java,lucene,position,term,posting,Java,Lucene,Position,Term,Posting,给定文档中的术语匹配,访问该匹配词的最佳方式是什么?我读过这篇文章,, 但问题是,自从这篇文章(2009年)发布以来,Lucene API完全改变了,有人能告诉我如何在更新版本的Lucene(如Lucene 4.6.1)中做到这一点吗 编辑: public class TermVectorFun { public static String[] DOCS = { "The quick red fox jumped over the lazy brown dogs.", "Ma

给定文档中的术语匹配,访问该匹配词的最佳方式是什么?我读过这篇文章,, 但问题是,自从这篇文章(2009年)发布以来,Lucene API完全改变了,有人能告诉我如何在更新版本的Lucene(如Lucene 4.6.1)中做到这一点吗

编辑

public class TermVectorFun {
  public static String[] DOCS = {
    "The quick red fox jumped over the lazy brown dogs.",
    "Mary had a little lamb whose fleece was white as snow.",
    "Moby Dick is a story of a whale and a man obsessed.",
    "The robber wore a black fleece jacket and a baseball cap.",
    "The English Springer Spaniel is the best of all dogs.",
    "The fleece was green and red",
        "History looks fondly upon the story of the golden fleece, but most people don't agree"
  };

  public static void main(String[] args) throws IOException {
    RAMDirectory ramDir = new RAMDirectory();
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, new StandardAnalyzer(Version.LUCENE_46));
    config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    //Index some made up content
    IndexWriter writer = new IndexWriter(ramDir, config);
    for (int i = 0; i < DOCS.length; i++) {
      Document doc = new Document();
      Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
      doc.add(id);
      //Store both position and offset information
      Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
      doc.add(text);
      writer.addDocument(doc);
    }
    writer.close();
    //Get a searcher

    DirectoryReader dirReader = DirectoryReader.open(ramDir);
    IndexSearcher searcher = new IndexSearcher(dirReader);
    // Do a search using SpanQuery
    SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
    TopDocs results = searcher.search(fleeceQ, 10);
    for (int i = 0; i < results.scoreDocs.length; i++) {
      ScoreDoc scoreDoc = results.scoreDocs[i];
      System.out.println("Score Doc: " + scoreDoc);
    }
    IndexReader reader = searcher.getIndexReader();
    Spans spans = fleeceQ.getSpans(reader.leaves().get(0), null, new LinkedHashMap<Term, TermContext>());
    int window = 2;//get the words within two of the match
    while (spans.next() == true) {
      int start = spans.start() - window;
      int end = spans.end() + window;
      Map<Integer, String> entries = new TreeMap<Integer, String>();

      System.out.println("Doc: " + spans.doc() + " Start: " + start + " End: " + end);
      Fields fields = reader.getTermVectors(spans.doc());
      Terms terms = fields.terms("content");

      TermsEnum termsEnum = terms.iterator(null);
      BytesRef text;
      while((text = termsEnum.next()) != null) {        
        //could store the BytesRef here, but String is easier for this example
        String s = new String(text.bytes, text.offset, text.length);
        DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null);
        if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
          int i = 0;
          int position = -1;
          while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) {
            if (position >= start && position <= end) {
              entries.put(position, s);
            }
            i++;
          }
        }
      }
      System.out.println("Entries:" + entries);
    }
  }
}
我现在明白了(发布API(TermEnum、TermDocsEnum、TermPositionsEnum)已经被删除,取而代之的是新的灵活索引(flex)API(Fields、FieldsEnum、Terms、TermsEnum、DocsEnum、DocsAndPositionsEnum)。一个很大的区别是字段和术语现在分别枚举:TermsEnum提供BytesRef(将每个术语的一个字节[])包装在一个字段中,而不是一个术语。另一个是,当请求文档/AndPositionsEnum时,您现在可以明确指定SKIPDOC(通常这是已删除的文档,但通常您可以提供任何位)。

公共类术语VectorFun{
公共静态字符串[]文档={
“敏捷的红狐跳过了懒惰的棕色狗。”,
“玛丽有一只小羊羔,它的羊毛像雪一样白。”,
《白鲸记》讲述的是一头鲸鱼和一个痴迷的男人的故事,
“强盗穿着黑色羊毛夹克和棒球帽。”,
“英国斯普林格猎犬是所有狗中最好的。”,
“羊毛是绿色和红色的”,
历史深情地看待金羊毛的故事,但大多数人并不同意
};
公共静态void main(字符串[]args)引发IOException{
RAMDirectory ramDir=新的RAMDirectory();
IndexWriterConfig配置=新的IndexWriterConfig(Version.LUCENE_46,新的StandardAnalyzer(Version.LUCENE_46));
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
//索引一些虚构的内容
IndexWriter writer=新的IndexWriter(ramDir,config);
对于(int i=0;i=start&&position使用。
Highlighter.getBestFragment
可用于获取包含最佳匹配的文本部分。类似于:

TopDocs docs = searcher.search(query, maxdocs);
Document firstDoc = search.doc(docs.scoreDocs[0].doc);

Scorer scorer = new QueryScorer(query)
Highlighter highlighter = new Highlighter(scorer);
highlighter.GetBestFragment(myAnalyzer, fieldName, firstDoc.get(fieldName));

谢谢,但我不认为我需要更高级的类来完成此操作。当然不需要。如果您愿意,您可以自己通过返回的文档对您的术语执行线性搜索。但是为什么不使用为此目的设计的工具呢?是的,您是对的,我已经尝试过您的解决方案,甚至搜索文本也会被阻止。根据您的解决方案,我仍然可以得到匹配周围的话,谢谢!