Java 使用Lucene对分类结果进行计数_Java_Search_Lucene_Filtering_Catalog

Java 使用Lucene对分类结果进行计数

java search lucene

Java 使用Lucene对分类结果进行计数,java,search,lucene,filtering,catalog,Java,Search,Lucene,Filtering,Catalog,我正在尝试使用LuceneJava2.3.2来实现对产品目录的搜索。除了产品的常规字段外，还有一个名为“类别”的字段。一个产品可以分为多个类别。目前，我使用FilteredQuery在每个类别中搜索相同的搜索词，以获得每个类别的结果数这将导致每个查询调用20-30次内部搜索以显示结果。这大大降低了搜索速度。使用Lucene？ < P>是否有更快的实现相同结果的方法？您可能需要考虑使用.< /P>查看与类别匹配的所有文档。此示例代码遍历每个“类别”术语，然后统计与该术语匹配的文档数 publi

我正在尝试使用LuceneJava2.3.2来实现对产品目录的搜索。除了产品的常规字段外，还有一个名为“类别”的字段。一个产品可以分为多个类别。目前，我使用FilteredQuery在每个类别中搜索相同的搜索词，以获得每个类别的结果数

这将导致每个查询调用20-30次内部搜索以显示结果。这大大降低了搜索速度。使用Lucene？

< P>是否有更快的实现相同结果的方法？您可能需要考虑使用.< /P>查看与类别匹配的所有文档。此示例代码遍历每个“类别”术语，然后统计与该术语匹配的文档数

public static void countDocumentsInCategories(IndexReader reader) throws IOException {
    TermEnum terms = null;
    TermDocs td = null;


    try {
        terms = reader.terms(new Term("Category", ""));
        td = reader.termDocs();
        do {
            Term currentTerm = terms.term();

            if (!currentTerm.field().equals("Category")) {
                break;
            }

            int numDocs = 0;
            td.seek(terms);
            while (td.next()) {
                numDocs++;
            }

            System.out.println(currentTerm.field() + " : " + currentTerm.text() + " --> " + numDocs);
        } while (terms.next());
    } finally {
        if (td != null) td.close();
        if (terms != null) terms.close();
    }
}

即使对于大型索引，此代码也应该运行得相当快

下面是一些测试该方法的代码：

public static void main(String[] args) throws Exception {
    RAMDirectory store = new RAMDirectory();

    IndexWriter w = new IndexWriter(store, new StandardAnalyzer());
    addDocument(w, 1, "Apple", "fruit", "computer");
    addDocument(w, 2, "Orange", "fruit", "colour");
    addDocument(w, 3, "Dell", "computer");
    addDocument(w, 4, "Cumquat", "fruit");
    w.close();

    IndexReader r = IndexReader.open(store);
    countDocumentsInCategories(r);
    r.close();
}

private static void addDocument(IndexWriter w, int id, String name, String... categories) throws IOException {
    Document d = new Document();
    d.add(new Field("ID", String.valueOf(id), Field.Store.YES, Field.Index.UN_TOKENIZED));
    d.add(new Field("Name", name, Field.Store.NO, Field.Index.UN_TOKENIZED));

    for (String category : categories) {
        d.add(new Field("Category", category, Field.Store.NO, Field.Index.UN_TOKENIZED));
    }

    w.addDocument(d);
}

我没有足够的声誉来评论（！），但在马特·奎尔的回答中，我非常肯定你可以取代这个：

int numDocs = 0;
td.seek(terms);
while (td.next()) {
    numDocs++;
}

为此：

int numDocs = terms.docFreq()

然后完全去掉td变量。这应该会使它更快。

让我看看我是否正确理解了这个问题：给定用户的查询，您希望显示每个类别中的查询有多少匹配项。对吗

这样想：您的查询实际上是

原始查询和（category1或category2或…

，除了要获得每个类别的数字的总分之外。不幸的是，在Lucene中收集点击的界面非常狭窄，只能为查询提供总分。但是您可以实现一个自定义记分器/收集器

查看org.apache.lucene.search.DisjunctionSumScorer的源代码。您可以复制其中的一些内容，编写一个自定义记分器，在主搜索进行时遍历类别匹配。你可以保存一张

地图

来跟踪每个类别中的匹配项。

以下是我所做的，尽管它的内存有点大：

您需要的是预先创建一组s，每个类别一个，包含一个类别中所有文档的文档id。现在，在搜索时，您可以使用a并根据位集检查文档ID

以下是创建位集的代码：

public BitSet[] getBitSets(IndexSearcher indexSearcher, 
                           Category[] categories) {
    BitSet[] bitSets = new BitSet[categories.length];
    for(int i=0; i<categories.length; i++)
    {
        Query query = categories[i].getQuery();
        final BitSet bitset = new BitSet()
        indexSearcher.search(query, new HitCollector() {
            public void collect(int doc, float score) {
                bitSet.set(doc);
            }
        });
        bitSets[i] = bitSet;
    }
    return bitSets;
}

公共位集[]获取位集（IndexSearcher IndexSearcher，
类别[]类别）{
比特集[]比特集=新比特集[categories.length]；
对于（int i=0；iSachin，我相信您需要。它不是Lucene的现成功能。我建议您尝试使用，它有一个主要和方便的功能。您很快就会出现（评论）我这样做了，但它给出了所有文档的计数，在我的情况下，我想从结果集中计算类别。例如，如果用户搜索“apple”然后我想显示在“电子产品和水果”类别中找到的匹配项的数量。但您和matt的建议计算了所有文档的数量。我认为我需要搜索我的搜索者而不是阅读器，但搜索者没有TermDocs。这只计算类别字段中每个术语标记的文档，您可以更快地进行搜索ith terms.docFreq（）。缺少的是与用户搜索条件中的点击数的交集。getCategoryCount部分的其他实现：您实际上可以从搜索中获得位集（使用收集器）然后将resultsBitSet与您感兴趣的任何categoryBitSet相交。相交应该比检查每个文档更快，并且您还可以在与results BitSet相交之前将多个类别相交。
public int[] getCategroryCount(IndexSearcher indexSearcher, 
                               Query query, 
                               final BitSet[] bitSets) {
    final int[] count = new int[bitSets.length];
    indexSearcher.search(query, new HitCollector() {
        public void collect(int doc, float score) {
            for(int i=0; i<bitSets.length; i++) {
                if(bitSets[i].get(doc)) count[i]++;
            }
        }
    });
    return count;
}