以倒排索引格式打印lucene

以倒排索引格式打印lucene,lucene,inverted-index,Lucene,Inverted Index,据我了解,Lucene使用反向索引。有没有办法以反向索引格式提取/打印lucene索引(lucene 6): term1 <doc1, doc100, ..., doc555> term1 <doc1, ..., doc100, ..., do89> term1 <doc3, doc2, doc5, ...> . . . termn <doc10, doc43, ..., dock> term1 术语1 术语1 . . . ter

据我了解,Lucene使用反向索引。有没有办法以反向索引格式提取/打印lucene索引(lucene 6):

term1   <doc1, doc100, ..., doc555>
term1   <doc1, ..., doc100, ..., do89>
term1   <doc3, doc2, doc5, ...>
.
.
.
termn   <doc10, doc43, ..., dock>
term1
术语1
术语1
.
.
.
termn

我正在使用Lucene 6.x.x,我不确定是否有任何简单的方法,但有解决方案总比没有解决方案好。使用-
MatchAllDocsQuery
类似的方法对我很有效

private static void printWholeIndex(IndexSearcher searcher) throws IOException{
        MatchAllDocsQuery query = new MatchAllDocsQuery();
        TopDocs hits = searcher.search(query, Integer.MAX_VALUE);

        Map<String, Set<Integer>>  invertedIndex = new HashMap<>();


        if (null == hits.scoreDocs || hits.scoreDocs.length <= 0) {
            System.out.println("No Hits Found with MatchAllDocsQuery");
            return;
        }

        for (ScoreDoc hit : hits.scoreDocs) {
            Document doc = searcher.doc(hit.doc);

            List<IndexableField> allFields = doc.getFields();

            for(IndexableField field:allFields){



            //Single document inverted index 
            Terms terms = searcher.getIndexReader().getTermVector(hit.doc,field.name());

            if (terms != null )  {
                TermsEnum termsEnum = terms.iterator();
                while(termsEnum.next() != null){
                if(invertedIndex.containsKey(termsEnum.term().utf8ToString())){
                    Set<Integer> existingDocs = invertedIndex.get(termsEnum.term().utf8ToString());
                    existingDocs.add(hit.doc);
                    invertedIndex.put(termsEnum.term().utf8ToString(),existingDocs);

                }else{
                    Set<Integer> docs = new TreeSet<>();
                    docs.add(hit.doc);
                    invertedIndex.put(termsEnum.term().utf8ToString(), docs);
                }
                }
            }
        }
        }

        System.out.println("Printing Inverted Index:");

        invertedIndex.forEach((key , value) -> {System.out.println(key+":"+value);
        });
    }
private static void printWholeIndex(IndexSearcher搜索器)引发IOException{
MatchAllDocsQuery query=新建MatchAllDocsQuery();
TopDocs hits=searcher.search(查询,整数.MAX_值);
Map invertedIndex=新的HashMap();
if(null==hits.scoreDocs | | hits.scoreDocs.length{System.out.println(key+“:”+value);
});
}
两点,

1.支持的最大文档数-
Integer.MAX\u值
。我没有尝试过,但使用searcher的
searchAfter
方法并执行多次搜索可能可以消除此限制


2.
doc.getFields()
只返回存储的字段。如果所有索引字段都没有存储,那么您可能可以保留一个静态字段数组,
Terms=searcher.getIndexReader().getTermVector(hit.doc,field.name())也适用于未存储的字段

您可以使用TermEnum来迭代反向索引中的项。然后,对于每个术语,您应该使用它的PostingsEnum来迭代发布。如果您有一个带有单个段的索引(Lucene版本:6_5_1),则以下代码可以工作:


如果索引有多个段,那么$reader.leaves()$将返回具有其他读卡器作为其叶子的读卡器(想象一下索引读卡器树)。在这种情况下,您应该遍历树以到达叶子,并在for循环中为每个叶子重复代码

为Lucene 6.6开发了一个打印docId:tokenPos的版本

Directory directory = new RAMDirectory();
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
IndexWriter writer = new IndexWriter(directory, iwc);

FieldType type = new FieldType();
type.setStoreTermVectors(true);
type.setStoreTermVectorPositions(true);
type.setStoreTermVectorOffsets(true);
type.setIndexOptions(IndexOptions.DOCS);

Field fieldStore = new Field("text", "We hold that proof beyond a reasonable doubt is required.", type);
Document doc = new Document();
doc.add(fieldStore);
writer.addDocument(doc);

fieldStore = new Field("text", "We hold that proof requires reasoanble preponderance of the evidenceb.", type);
doc = new Document();
doc.add(fieldStore);
writer.addDocument(doc);

writer.close();

DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);

MatchAllDocsQuery query = new MatchAllDocsQuery();
TopDocs hits = searcher.search(query, Integer.MAX_VALUE);

Map<String, Set<String>> invertedIndex = new HashMap<>();
BiFunction<Integer, Integer, Set<String>> mergeValue = 
    (docId, pos)-> {TreeSet<String> s = new TreeSet<>(); s.add((docId+1)+":"+pos); return s;};

for ( ScoreDoc scoreDoc: hits.scoreDocs ) {
    Fields termVs = reader.getTermVectors(scoreDoc.doc);
    Terms terms = termVs.terms("text");
    TermsEnum termsIt = terms.iterator();
    PostingsEnum docsAndPosEnum = null;
    BytesRef bytesRef;
    while ( (bytesRef = termsIt.next()) != null ) {
        docsAndPosEnum = termsIt.postings(docsAndPosEnum, PostingsEnum.ALL);
        docsAndPosEnum.nextDoc();
        int pos = docsAndPosEnum.nextPosition();
        String term = bytesRef.utf8ToString();
        invertedIndex.merge(
            term, 
            mergeValue.apply(scoreDoc.doc, pos), 
            (s1,s2)->{s1.addAll(s2); return s1;}
        );
    }
}
System.out.println( invertedIndex);
Directory Directory=new-RAMDirectory();
Analyzer Analyzer=新的StandardAnalyzer();
IndexWriterConfig iwc=新的IndexWriterConfig(分析器);
iwc.setOpenMode(OpenMode.CREATE);
IndexWriter writer=新的IndexWriter(目录,iwc);
FieldType类型=新的FieldType();
类型.setStoreTermVectors(真);
类型。设置存储项矢量位置(真);
类型。SetStoreTermVectorOffset(真);
type.setIndexOptions(IndexOptions.DOCS);
fieldStore=新字段(“文本”,“我们认为需要无合理怀疑的证据。”,类型);
单据单据=新单据();
文件添加(fieldStore);
writer.addDocument(doc);
fieldStore=新字段(“文本”,“我们认为证明需要证据的合理优势B”,类型);
doc=新文档();
文件添加(fieldStore);
writer.addDocument(doc);
writer.close();
DirectoryReader=DirectoryReader.open(目录);
IndexSearcher search=新的IndexSearcher(阅读器);
MatchAllDocsQuery query=新建MatchAllDocsQuery();
TopDocs hits=searcher.search(查询,整数.MAX_值);
Map invertedIndex=新的HashMap();
双功能合并值=
(docId,pos)->{TreeSet s=new TreeSet();s.add((docId+1)+:“+pos);返回s;};
for(ScoreDoc ScoreDoc:hits.scoreDocs){
字段termVs=reader.getTermVectors(scoreDoc.doc);
条款=条款与条款(“文本”);
TermsEnum termsIt=terms.iterator();
PostingsEnum docsAndPosEnum=null;
BytesRef BytesRef;
while((bytesRef=termsIt.next())!=null){
docsAndPosEnum=termsIt.posting(docsAndPosEnum,PostingsEnum.ALL);
docsAndPosEnum.nextDoc();
int pos=docsAndPosEnum.nextPosition();
字符串项=bytesRef.utf8ToString();
反向索引合并(
学期
mergeValue.apply(scoreDoc.doc,pos),
(s1,s2)->{s1.addAll(s2);返回s1;}
);
}
}
系统输出打印LN(逆变器索引);

请注意,此解决方案效率低下(例如,3天的推文索引需要花费很长时间)。3天的推文索引是不相关的,请提及文档数量。另外,我很清楚,我对你所问的逻辑缺乏经验,性能角度是我没有考虑过的。我也会研究性能方面的问题。如果这对于一小部分文档来说是正确的,那么您可能会考虑让这部分逻辑具有可伸缩性。
Directory directory = new RAMDirectory();
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
IndexWriter writer = new IndexWriter(directory, iwc);

FieldType type = new FieldType();
type.setStoreTermVectors(true);
type.setStoreTermVectorPositions(true);
type.setStoreTermVectorOffsets(true);
type.setIndexOptions(IndexOptions.DOCS);

Field fieldStore = new Field("text", "We hold that proof beyond a reasonable doubt is required.", type);
Document doc = new Document();
doc.add(fieldStore);
writer.addDocument(doc);

fieldStore = new Field("text", "We hold that proof requires reasoanble preponderance of the evidenceb.", type);
doc = new Document();
doc.add(fieldStore);
writer.addDocument(doc);

writer.close();

DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);

MatchAllDocsQuery query = new MatchAllDocsQuery();
TopDocs hits = searcher.search(query, Integer.MAX_VALUE);

Map<String, Set<String>> invertedIndex = new HashMap<>();
BiFunction<Integer, Integer, Set<String>> mergeValue = 
    (docId, pos)-> {TreeSet<String> s = new TreeSet<>(); s.add((docId+1)+":"+pos); return s;};

for ( ScoreDoc scoreDoc: hits.scoreDocs ) {
    Fields termVs = reader.getTermVectors(scoreDoc.doc);
    Terms terms = termVs.terms("text");
    TermsEnum termsIt = terms.iterator();
    PostingsEnum docsAndPosEnum = null;
    BytesRef bytesRef;
    while ( (bytesRef = termsIt.next()) != null ) {
        docsAndPosEnum = termsIt.postings(docsAndPosEnum, PostingsEnum.ALL);
        docsAndPosEnum.nextDoc();
        int pos = docsAndPosEnum.nextPosition();
        String term = bytesRef.utf8ToString();
        invertedIndex.merge(
            term, 
            mergeValue.apply(scoreDoc.doc, pos), 
            (s1,s2)->{s1.addAll(s2); return s1;}
        );
    }
}
System.out.println( invertedIndex);