elasticsearch Bucket\u选择器聚合和大小。优化,elasticsearch,elasticsearch" /> elasticsearch Bucket\u选择器聚合和大小。优化,elasticsearch,elasticsearch" />

elasticsearch Bucket\u选择器聚合和大小。优化

elasticsearch Bucket\u选择器聚合和大小。优化,elasticsearch,elasticsearch,我对bucket_选择器聚合有疑问。 (环境测试:centos7上的ES6.8和ES7基本版) 在我的用例中,如果存在所选属性的重复,我需要删除文档。索引不是很大,大约有2百万条记录。 查找这些记录的查询如下所示: GET index_id1/_search { "size": 0, "aggs": { "byNested": { "nested": { "path": "nestedObjects" }, "aggs": {

我对bucket_选择器聚合有疑问。 (环境测试:centos7上的ES6.8和ES7基本版)

在我的用例中,如果存在所选属性的重复,我需要删除文档。索引不是很大,大约有2百万条记录。 查找这些记录的查询如下所示:

GET index_id1/_search
{
  "size": 0,
  "aggs": {
    "byNested": {
      "nested": {
        "path": "nestedObjects"
      },
      "aggs": {
        "sameIds": {
          "terms": {
            "script": {
              "lang": "painless",
              "source": "return doc['nestedObjects.id'].value"
            },
            "size": 1000
          },
          "aggs": {
            "byId": {
              "reverse_nested": {}
            },
            "byId_bucket_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "totalCount": "byId._count"
                },
                "script": {
                  "source": "params.totalCount > 1"
                }
              }
            }
          }
        }
      }
    }
  }
}
我把桶拿回来。但是要放松查询和负载。我按大小做:1000。所以,发出下一个查询以获取更多重复,直到零返回。 但问题是,重复次数太少。我通过设置大小:2000000检查了查询结果:

GET index_id1/_search
{
  "size": 0,
  "aggs": {
    "byNested": {
      "nested": {
        "path": "nestedObjects"
      },
      "aggs": {
        "sameIds": {
          "terms": {
            "script": {
              "lang": "painless",
              "source": "return doc['nestedObjects.id'].value"
            },
            "size": 2000000  <-- too big
          },
          "aggs": {
            "byId": {
              "reverse_nested": {}
            },
            "byId_bucket_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "totalCount": "byId._count"
                },
                "script": {
                  "source": "params.totalCount > 1"
                }
              }
            }
          }
        }
      }
    }
  }
}
据我所知,它也做同样的事情,只是我需要打2000个电话(大小:每个1000个)来浏览整个索引。 复合agg是否缓存结果,或者为什么这样更好? 在这种情况下,也许有更好的方法

GET index_id1/_search
{
  "aggs": {
    "byNested": {
      "nested": {
        "path": "nestedObjects"
      },
      "aggs": {
        "compositeAgg": {
          "composite": {
            "after": {
              "termsAgg": "03f10a7d-0162-4409-8647-c643274d6727"
            },
            "size": 1000,
            "sources": [
              {
                "termsAgg": {
                  "terms": {
                    "script": {
                      "lang": "painless",
                      "source": "return doc['nestedObjects.id'].value"
                    }
                  }
                }
              }
            ]
          },
          "aggs": {
            "byId": {
              "reverse_nested": {}
            },
            "byId_bucket_filter": {
              "bucket_selector": {
                "script": {
                  "source": "params.totalCount > 1"
                },
                "buckets_path": {
                  "totalCount": "byId._count"
                }
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}