获取Solr中文档子集的总词频_Solr_Lucene_Faceted Search

获取Solr中文档子集的总词频

solr lucene

获取Solr中文档子集的总词频,solr,lucene,faceted-search,Solr,Lucene,Faceted Search,我对使用Solr分析文档和获取符合特定条件的所有文档的词频感兴趣我尝试了termVectorComponent，但我只能得到单个文档的术语频率，而不能得到文档组的总数例如，给定以下数据： { "id": "1", "category": "cat1", "includes": "The green car.", }, { "id": "2", "category": "cat1", "includes": "The red car.

我对使用Solr分析文档和获取符合特定条件的所有文档的词频感兴趣

我尝试了termVectorComponent，但我只能得到单个文档的术语频率，而不能得到文档组的总数

例如，给定以下数据：

  {
    "id": "1",
    "category": "cat1",
    "includes": "The green car.",
  },
  {
    "id": "2",
    "category": "cat1",
    "includes": "The red car.",
  },
  {
    "id": "3",
    "category": "cat2",
    "includes": "The black car.",
  }

我希望能够得到每个类别的总术语频率计数。 e、 g


2.
2.
1.
1.
1.
1.
1.

我尝试使用facet，但无法让它们组合单个文档的字数，如上图所示。我注意到termVector支持为整个索引中使用的术语提供了文档频率，但这对我来说没有用处。我只需要文档子集的总频率计数

有人对如何从Solr/Lucene获得这些信息有什么建议吗

提前感谢。

我找到了这个链接；您必须修改TermsComponent.java（可能是solrJ？）

我从未尝试过，但您是否也可以使用functionquery（即sum）来添加tv.df值？这里是functionqueries的完整列表

似乎在提供了解决方案（通过相关问题找到）

<category name="cat1">
   <lst name="the">2</lst>
   <lst name="car">2</lst>
   <lst name="green">1</lst>
   <lst name="red">1</lst>
</category>
<category name="cat2">
   <lst name="the">1</lst>
   <lst name="car">1</lst>
   <lst name="black">1</lst>
</category>