elasticsearch Elasticsearch:统计文档中的术语,elasticsearch,elasticsearch" /> elasticsearch Elasticsearch:统计文档中的术语,elasticsearch,elasticsearch" />

elasticsearch Elasticsearch:统计文档中的术语

elasticsearch Elasticsearch:统计文档中的术语,elasticsearch,elasticsearch,我对使用6.5版相当陌生。我的数据库包含网站页面及其内容,如下所示: Url Content abc.com There is some content about cars here. Lots of cars! def.com This page is all about cars. ghi.com Here it tells us something about insurances. jkl.com Another page about cars and how to

我对使用6.5版相当陌生。我的数据库包含网站页面及其内容,如下所示:

Url      Content
abc.com  There is some content about cars here. Lots of cars!
def.com  This page is all about cars.
ghi.com  Here it tells us something about insurances.
jkl.com  Another page about cars and how to buy cars.
{'took': 2521, 
'timed_out': False, 
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 
'hits': {'total': 29, 'max_score': 3.0240571, 'hits': [{'_index': 
'pages', '_type': 'pages', '_id': '17277', '_score': 3.0240571, 
'_source': {'content': '....'}}]}}
我已经能够执行一个简单的查询,返回内容中包含单词“cars”的所有文档(使用Python):

结果如下所示:

Url      Content
abc.com  There is some content about cars here. Lots of cars!
def.com  This page is all about cars.
ghi.com  Here it tells us something about insurances.
jkl.com  Another page about cars and how to buy cars.
{'took': 2521, 
'timed_out': False, 
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 
'hits': {'total': 29, 'max_score': 3.0240571, 'hits': [{'_index': 
'pages', '_type': 'pages', '_id': '17277', '_score': 3.0240571, 
'_source': {'content': '....'}}]}}
“_id”指的是一个域,因此我基本上可以返回:

  • abc.com
  • def.com
  • jkl.com
但我现在想知道搜索词(“汽车”)在每个文档中出现的频率,如:

  • abc.com:2
  • def.com:1
  • jkl.com:2
我找到了几种方法来获取包含searchterm的文档数量,但没有一种方法能够告诉我如何获取文档中的术语数量。我也找不到任何东西,虽然我很确定在那里的某个地方,我可能只是没有意识到这是我的问题的解决办法

更新:

正如@quitious\u MInd所建议的,我尝试了术语聚合:

current_app.elasticsearch.search(index=index, doc_type=index, 
    body={"aggs" : {"cars_count" : {"terms" : { "field" : "Content" 
}}}})
结果:

{'took': 729, 'timed_out': False, '_shards': {'total': 5, 'successful': 
5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 48, 'max_score': 1.0, 
'hits': [{'_index': 'pages', '_type': 'pages', '_id': '17252', 
'_score': 1.0, '_source': {'content': '...'}}]}, 'aggregations': 
{'skala_count': {'doc_count_error_upper_bound': 0, 
'sum_other_doc_count': 0, 'buckets': []}}}

我不知道它在哪里显示每个文档的计数,但我假设这是因为“bucket”是空的?另一个注意事项是:通过术语聚合得到的结果明显比使用多匹配查询得到的结果差。有没有办法把它们结合起来?

我想你需要术语聚合如下所示


我想您需要术语聚合,如下所示


您试图实现的目标不能在一个查询中完成。第一个查询是过滤并获取需要术语计数的文档ID。 假设您具有以下映射:

{
  "test": {
    "mappings": {
      "_doc": {
        "properties": {
          "details": {
            "type": "text",
            "store": true,
            "term_vector": "with_positions_offsets_payloads"
          },
          "name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
假设查询返回以下两个文档:

{
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "details": "There is some content about cars here. Lots of cars!",
          "name": "n1"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "_source": {
          "details": "This page is all about cars",
          "name": "n2"
        }
      }
    ]
  }
}
从上面的响应中,您可以获得与查询匹配的所有文档ID。对于以上内容,我们有:
“\u id”:“1”
“\u id”:“2”

现在,我们使用
\u mtermvectors
api获取给定字段中每个术语的频率(计数):

test/_doc/_mtermvectors
{
  "docs": [
    {
      "_id": "1",
      "fields": [
        "details"
      ]
    },
    {
      "_id": "2",
      "fields": [
        "details"
      ]
    }
  ]
}
上面返回以下结果:

{
  "docs": [
    {
      "_index": "test",
      "_type": "_doc",
      "_id": "1",
      "_version": 1,
      "found": true,
      "took": 8,
      "term_vectors": {
        "details": {
          "field_statistics": {
            "sum_doc_freq": 15,
            "doc_count": 2,
            "sum_ttf": 16
          },
          "terms": {
            ....
            ,
            "cars": {
              "term_freq": 2,
              "tokens": [
                {
                  "position": 5,
                  "start_offset": 28,
                  "end_offset": 32
                },
                {
                  "position": 9,
                  "start_offset": 47,
                  "end_offset": 51
                }
              ]
            },
            ....
          }
        }
      }
    },
    {
      "_index": "test",
      "_type": "_doc",
      "_id": "2",
      "_version": 1,
      "found": true,
      "took": 2,
      "term_vectors": {
        "details": {
          "field_statistics": {
            "sum_doc_freq": 15,
            "doc_count": 2,
            "sum_ttf": 16
          },
          "terms": {
            ....
            ,
            "cars": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 5,
                  "start_offset": 23,
                  "end_offset": 27
                }
              ]
            },
            ....
        }
      }
    }
  ]
}
请注意,我使用
表示字段中的其他术语数据,因为term vector api返回所有术语的术语相关详细信息。
您完全可以从上面的响应中提取有关所需术语的信息,这里我为
汽车
显示了您感兴趣的字段是
term\u freq

您试图实现的目标无法在单个查询中完成。第一个查询是过滤并获取需要术语计数的文档ID。 假设您具有以下映射:

{
  "test": {
    "mappings": {
      "_doc": {
        "properties": {
          "details": {
            "type": "text",
            "store": true,
            "term_vector": "with_positions_offsets_payloads"
          },
          "name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
假设查询返回以下两个文档:

{
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "details": "There is some content about cars here. Lots of cars!",
          "name": "n1"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "_source": {
          "details": "This page is all about cars",
          "name": "n2"
        }
      }
    ]
  }
}
从上面的响应中,您可以获得与查询匹配的所有文档ID。对于以上内容,我们有:
“\u id”:“1”
“\u id”:“2”

现在,我们使用
\u mtermvectors
api获取给定字段中每个术语的频率(计数):

test/_doc/_mtermvectors
{
  "docs": [
    {
      "_id": "1",
      "fields": [
        "details"
      ]
    },
    {
      "_id": "2",
      "fields": [
        "details"
      ]
    }
  ]
}
上面返回以下结果:

{
  "docs": [
    {
      "_index": "test",
      "_type": "_doc",
      "_id": "1",
      "_version": 1,
      "found": true,
      "took": 8,
      "term_vectors": {
        "details": {
          "field_statistics": {
            "sum_doc_freq": 15,
            "doc_count": 2,
            "sum_ttf": 16
          },
          "terms": {
            ....
            ,
            "cars": {
              "term_freq": 2,
              "tokens": [
                {
                  "position": 5,
                  "start_offset": 28,
                  "end_offset": 32
                },
                {
                  "position": 9,
                  "start_offset": 47,
                  "end_offset": 51
                }
              ]
            },
            ....
          }
        }
      }
    },
    {
      "_index": "test",
      "_type": "_doc",
      "_id": "2",
      "_version": 1,
      "found": true,
      "took": 2,
      "term_vectors": {
        "details": {
          "field_statistics": {
            "sum_doc_freq": 15,
            "doc_count": 2,
            "sum_ttf": 16
          },
          "terms": {
            ....
            ,
            "cars": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 5,
                  "start_offset": 23,
                  "end_offset": 27
                }
              ]
            },
            ....
        }
      }
    }
  ]
}
请注意,我使用
表示字段中的其他术语数据,因为term vector api返回所有术语的术语相关详细信息。
您完全可以从上述回复中提取有关所需术语的信息,这里我为
汽车
显示了相关信息,您感兴趣的领域是
术语频率

非常感谢您的帮助。我想我的用例可能比我想象的要复杂,我更新了这方面的问题。非常感谢你的帮助。我想我的用例可能比我想象的更复杂,我更新了我在这方面的问题。我不认为有这样的事情,但可能会对您有所帮助。Termvectors是一种可能性,您可以在单词长度上应用过滤器,它们返回每个文档的计数。在客户端,你可以过滤字典中的确切术语,而不是最理想的,因为你会得到很多额外的可能。我不认为有这样的事情,但可能会对你有所帮助。Termvectors是一种可能性,你可以对单词的长度应用过滤器,它们返回每个文档的计数。在客户端,您可以过滤字典中的确切术语,而不是最佳术语,因为您可能会得到很多额外的信息