.net 如何通过嵌套搜索ElasticSearch,并筛选自定义属性上不同条目的结果
我正在.Net应用程序中使用NEST,该应用程序跟踪位置并将其存储在ElasticSearch中。这些TrackedLocation(见下面的简化模型)都有一个用户ID,每个用户ID将有许多这样的索引TrackedLocation 现在我要查找和查询的是给定纬度/经度和半径组合附近的所有跟踪位置,但我只需要每个用户最近的一个。。。因此,基本上对用户ID执行一个“独特”过滤器,按LocatedAtUtc排序 我当然可以通过Linq等方式获取所有文档并对其进行后期处理/过滤,但如果Nest/ES能够以本机方式实现这一点,我当然更愿意采用这种方式 该查询的一个变体是只计算这些不同实例的数量,如。。在任何给定的纬度/经度/半径下有多少(每个用户不同) 模型与此类似:.net 如何通过嵌套搜索ElasticSearch,并筛选自定义属性上不同条目的结果,.net,elasticsearch,nest,.net,elasticsearch,Nest,我正在.Net应用程序中使用NEST,该应用程序跟踪位置并将其存储在ElasticSearch中。这些TrackedLocation(见下面的简化模型)都有一个用户ID,每个用户ID将有许多这样的索引TrackedLocation 现在我要查找和查询的是给定纬度/经度和半径组合附近的所有跟踪位置,但我只需要每个用户最近的一个。。。因此,基本上对用户ID执行一个“独特”过滤器,按LocatedAtUtc排序 我当然可以通过Linq等方式获取所有文档并对其进行后期处理/过滤,但如果Nest/ES能够
public class TrackedLocation
{
public Guid Id { get; set; }
public Guid UserId { get; set; }
public MyLocation Location { get; set; }
public DateTime LocatedAtUtc { get; set; }
}
public class MyLocation
{
public double Lat { get; set; }
public double Lon { get; set; }
}
。。MyLocation类型仅用于澄清
通过ES/Nest查询是否可能,如果可能,如何实现?因此,回答我自己的问题-在深入研究ES的聚合后,我发现以下解决方案(通过Nest)是最实用且精简的版本,完全满足了我的上述要求:
var userIdsAggregationForLast24HoursAndLocation = elasticClient.Search<BlogPost>(postSearch => postSearch
.Index(indexName)
.MatchAll()
.Source(false)
.TrackScores(false)
.Size(0)
.Aggregations(aggregationDescriptor => aggregationDescriptor
.Filter("trackedLocationsFromThePast24HoursAtGivenLocation", descriptor => descriptor
.Filter(filterDescriptor => filterDescriptor
.And(
combinedFilter => combinedFilter
.Range(dateRangeFilter => dateRangeFilter
.GreaterOrEquals(DateTime.UtcNow.Subtract(TimeSpan.FromDays(1))) // filter on instances created/indexed in the past 24 hours
.OnField(trackedLocation => trackedLocation.CreatedAtUtc)),
combinedFilter => combinedFilter // and the second filter here is the GeoDistance one.. 1km away from a given .Location(...,...)
.GeoDistance(trackedLocation => trackedLocation.Location, geoDistanceFilterDescriptor => geoDistanceFilterDescriptor
.Distance(1, GeoUnit.Kilometers)
.Location(37.809860, -122.476995)
.Optimize(GeoOptimizeBBox.Indexed))))
.Aggregations(userIdAggregate => userIdAggregate.Terms("userIds", userIdTermsFilter => userIdTermsFilter
.Field(trackedLocation => trackedLocation.UserId)
.Size(100)))))); // get X distinct .UserIds
。。ES的原始响应如下所示:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 100,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"trackedLocationsFromThePast24HoursAtGivenLocation": {
"doc_count": 12,
"userIds": {
"buckets": [
{
"key": "0a50c2b4-17f0-41bc-b380-f8fca8ca117c",
"doc_count": 5
},
{
"key": "6b59efd8-a1f9-43c4-86a1-8560b908705f",
"doc_count": 5
},
{
"key": "667fb1c9-4c9c-4570-8bc1-f61d72e4385f",
"doc_count": 1
},
{
"key": "73e93ec8-622b-42e3-8a1c-96a0a2b3b2b2",
"doc_count": 1
}
]
}
}
}
}
如您所见,在本例中,共有100个跟踪位置,其中12个跟踪位置是在过去一天由总共4个不同的用户(ID)创建和索引的。。。两个分别创建了5个,另两个分别创建了一个位置
那正是我所期望的。我并不真正关心分数或源/文档本身,如上所述,我只关心落入过滤器的跟踪位置,以及我想要的不同用户ID列表。因此,为了回答我自己的问题,在深入ES的聚合之后,我找到了以下解决方案(通过NEST)最实用、最精简的版本,完全符合我以上的要求:
var userIdsAggregationForLast24HoursAndLocation = elasticClient.Search<BlogPost>(postSearch => postSearch
.Index(indexName)
.MatchAll()
.Source(false)
.TrackScores(false)
.Size(0)
.Aggregations(aggregationDescriptor => aggregationDescriptor
.Filter("trackedLocationsFromThePast24HoursAtGivenLocation", descriptor => descriptor
.Filter(filterDescriptor => filterDescriptor
.And(
combinedFilter => combinedFilter
.Range(dateRangeFilter => dateRangeFilter
.GreaterOrEquals(DateTime.UtcNow.Subtract(TimeSpan.FromDays(1))) // filter on instances created/indexed in the past 24 hours
.OnField(trackedLocation => trackedLocation.CreatedAtUtc)),
combinedFilter => combinedFilter // and the second filter here is the GeoDistance one.. 1km away from a given .Location(...,...)
.GeoDistance(trackedLocation => trackedLocation.Location, geoDistanceFilterDescriptor => geoDistanceFilterDescriptor
.Distance(1, GeoUnit.Kilometers)
.Location(37.809860, -122.476995)
.Optimize(GeoOptimizeBBox.Indexed))))
.Aggregations(userIdAggregate => userIdAggregate.Terms("userIds", userIdTermsFilter => userIdTermsFilter
.Field(trackedLocation => trackedLocation.UserId)
.Size(100)))))); // get X distinct .UserIds
。。ES的原始响应如下所示:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 100,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"trackedLocationsFromThePast24HoursAtGivenLocation": {
"doc_count": 12,
"userIds": {
"buckets": [
{
"key": "0a50c2b4-17f0-41bc-b380-f8fca8ca117c",
"doc_count": 5
},
{
"key": "6b59efd8-a1f9-43c4-86a1-8560b908705f",
"doc_count": 5
},
{
"key": "667fb1c9-4c9c-4570-8bc1-f61d72e4385f",
"doc_count": 1
},
{
"key": "73e93ec8-622b-42e3-8a1c-96a0a2b3b2b2",
"doc_count": 1
}
]
}
}
}
}
如您所见,在本例中,共有100个跟踪位置,其中12个跟踪位置是在过去一天由总共4个不同的用户(ID)创建和索引的。。。两个分别创建了5个,另两个分别创建了一个位置
那正是我所期望的。我并不真正关心分数或源/文档本身,如上所述,我只关心落入过滤器的跟踪位置,以及我想要的不同用户ID列表。遵循@MartijnLaarman的最初建议,并在阅读了更多关于聚合内存消耗的内容后,我决定尝试一下他建议的父母/孩子方法,下面是我想要的结果。。不使用聚合,而只是对父/子关系进行筛选 现在,模型的设置与此类似:
elasticClient.CreateIndex(indexName, descriptor => descriptor
.NumberOfReplicas(0)
.NumberOfShards(1)
.AddMapping<User>(new RootObjectMapping // I use TTL for testing/dev purposes to auto-cleanup after me
{
AllFieldMapping = new AllFieldMapping { Enabled = false },
TtlFieldMappingDescriptor = new TtlFieldMapping { Enabled = true, Default = "5m" }
},
userDescriptor => userDescriptor.MapFromAttributes())
.AddMapping<TrackedLocation>(new RootObjectMapping // I use TTL for testing/dev purposes to auto-cleanup after me
{
AllFieldMapping = new AllFieldMapping { Enabled = false },
TtlFieldMappingDescriptor = new TtlFieldMapping { Enabled = true, Default = "5m" }
},
trackedLocationDescriptor => trackedLocationDescriptor
.MapFromAttributes()
.Properties(propertiesDescriptor => propertiesDescriptor
.GeoPoint(geoPointMappingDescriptor => geoPointMappingDescriptor.Name(post => post.Location).IndexLatLon()))
.SetParent<User>())); // < that's the essential part right here to allow the filtered query below
实际的过滤查询如下所示:
{
"size": 0,
"track_scores": false,
"_source": {
"exclude": [
"*"
]
},
"aggs": {
"trackedLocationsFromThePast24HoursAtGivenLocation": {
"filter": {
"and": {
"filters": [
{
"range": {
"createdAtUtc": {
"gte": "2015-07-18T07:25:05.992"
}
}
},
{
"geo_distance": {
"location": "37.80986, -122.476995",
"distance": 1.0,
"unit": "km",
"optimize_bbox": "indexed"
}
}
]
}
},
"aggs": {
"userIds": {
"terms": {
"field": "userId",
"size": 100
}
}
}
}
},
"query": {
"match_all": {}
}
}
elasticClient.Index(trackedLocation, descriptor => descriptor
.Index(indexName)
.Parent(parent.Id.ToString()));
var userIdsFilteredQueryForLast24HoursAndLocation = elasticClient.Search<User>(search => search
.Index(indexName)
.MatchAll()
.Source(false)
.TrackScores(false)
.Filter(outerFilter => outerFilter.HasChild<TrackedLocation>(childFilterDescriptor => childFilterDescriptor
.Filter(filterDescriptor => filterDescriptor
.And(
andCombinedFilter1 => andCombinedFilter1
.Range(dateRangeFilter => dateRangeFilter
.GreaterOrEquals(DateTime.UtcNow.Subtract(TimeSpan.FromDays(1))) // filter on instances created/indexed in the past 24 hours
.OnField(trackedLocation => trackedLocation.CreatedAtUtc)),
andCombinedFilter2 => andCombinedFilter2 // and the second filter here is the GeoDistance one.. 1km away from a given .Location(...,...)
.GeoDistance(trackedLocation => trackedLocation.Location, geoDistanceFilterDescriptor => geoDistanceFilterDescriptor
.Distance(1, GeoUnit.Kilometers)
.Location(37.809860, -122.476995)
.Optimize(GeoOptimizeBBox.Indexed)))))));
{
"track_scores": false,
"_source": {
"exclude": [
"*"
]
},
"query": {
"match_all": {}
},
"filter": {
"has_child": {
"type": "trackedlocation",
"filter": {
"and": {
"filters": [
{
"range": {
"createdAtUtc": {
"gte": "2015-07-18T08:58:02.664"
}
}
},
{
"geo_distance": {
"location": "37.80986, -122.476995",
"distance": 1.0,
"unit": "km",
"optimize_bbox": "indexed"
}
}
]
}
}
}
}
}
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.0,
"hits": [
{
"_index": "sampleindex",
"_type": "user",
"_id": "54ccbccd-eb2a-4a93-9be3-363b83cd3b21",
"_score": 1.0,
"_source": {}
},
{
"_index": "locationtracking____sampleindex",
"_type": "user",
"_id": "42482b3b-d4c7-4a92-bf59-a4c25d707835",
"_score": 1.0,
"_source": {}
}
]
}
}
搜索本身是针对与.HasChild筛选器组合的用户实例的。。这同样与聚合逻辑相同(按日期和位置)
例如,原始响应如下所示:
{
"size": 0,
"track_scores": false,
"_source": {
"exclude": [
"*"
]
},
"aggs": {
"trackedLocationsFromThePast24HoursAtGivenLocation": {
"filter": {
"and": {
"filters": [
{
"range": {
"createdAtUtc": {
"gte": "2015-07-18T07:25:05.992"
}
}
},
{
"geo_distance": {
"location": "37.80986, -122.476995",
"distance": 1.0,
"unit": "km",
"optimize_bbox": "indexed"
}
}
]
}
},
"aggs": {
"userIds": {
"terms": {
"field": "userId",
"size": 100
}
}
}
}
},
"query": {
"match_all": {}
}
}
elasticClient.Index(trackedLocation, descriptor => descriptor
.Index(indexName)
.Parent(parent.Id.ToString()));
var userIdsFilteredQueryForLast24HoursAndLocation = elasticClient.Search<User>(search => search
.Index(indexName)
.MatchAll()
.Source(false)
.TrackScores(false)
.Filter(outerFilter => outerFilter.HasChild<TrackedLocation>(childFilterDescriptor => childFilterDescriptor
.Filter(filterDescriptor => filterDescriptor
.And(
andCombinedFilter1 => andCombinedFilter1
.Range(dateRangeFilter => dateRangeFilter
.GreaterOrEquals(DateTime.UtcNow.Subtract(TimeSpan.FromDays(1))) // filter on instances created/indexed in the past 24 hours
.OnField(trackedLocation => trackedLocation.CreatedAtUtc)),
andCombinedFilter2 => andCombinedFilter2 // and the second filter here is the GeoDistance one.. 1km away from a given .Location(...,...)
.GeoDistance(trackedLocation => trackedLocation.Location, geoDistanceFilterDescriptor => geoDistanceFilterDescriptor
.Distance(1, GeoUnit.Kilometers)
.Location(37.809860, -122.476995)
.Optimize(GeoOptimizeBBox.Indexed)))))));
{
"track_scores": false,
"_source": {
"exclude": [
"*"
]
},
"query": {
"match_all": {}
},
"filter": {
"has_child": {
"type": "trackedlocation",
"filter": {
"and": {
"filters": [
{
"range": {
"createdAtUtc": {
"gte": "2015-07-18T08:58:02.664"
}
}
},
{
"geo_distance": {
"location": "37.80986, -122.476995",
"distance": 1.0,
"unit": "km",
"optimize_bbox": "indexed"
}
}
]
}
}
}
}
}
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.0,
"hits": [
{
"_index": "sampleindex",
"_type": "user",
"_id": "54ccbccd-eb2a-4a93-9be3-363b83cd3b21",
"_score": 1.0,
"_source": {}
},
{
"_index": "locationtracking____sampleindex",
"_type": "user",
"_id": "42482b3b-d4c7-4a92-bf59-a4c25d707835",
"_score": 1.0,
"_source": {}
}
]
}
}
。。它返回过去一天在给定位置具有TrackedLocation的用户的(正确的)用户(ID)点击集。太好了
我现在将继续使用这个解决方案,而不是聚合解决方案。。它是以ES中父/子关系的“成本”为代价的,但总的来说,它似乎更符合逻辑。遵循@MartijnLaarman的最初建议,并在阅读了更多关于聚合内存消耗的内容后,我决定尝试他建议的父/子方法,这是我想要的结果。。不使用聚合,而只是对父/子关系进行筛选 现在,模型的设置与此类似:
elasticClient.CreateIndex(indexName, descriptor => descriptor
.NumberOfReplicas(0)
.NumberOfShards(1)
.AddMapping<User>(new RootObjectMapping // I use TTL for testing/dev purposes to auto-cleanup after me
{
AllFieldMapping = new AllFieldMapping { Enabled = false },
TtlFieldMappingDescriptor = new TtlFieldMapping { Enabled = true, Default = "5m" }
},
userDescriptor => userDescriptor.MapFromAttributes())
.AddMapping<TrackedLocation>(new RootObjectMapping // I use TTL for testing/dev purposes to auto-cleanup after me
{
AllFieldMapping = new AllFieldMapping { Enabled = false },
TtlFieldMappingDescriptor = new TtlFieldMapping { Enabled = true, Default = "5m" }
},
trackedLocationDescriptor => trackedLocationDescriptor
.MapFromAttributes()
.Properties(propertiesDescriptor => propertiesDescriptor
.GeoPoint(geoPointMappingDescriptor => geoPointMappingDescriptor.Name(post => post.Location).IndexLatLon()))
.SetParent<User>())); // < that's the essential part right here to allow the filtered query below
实际的过滤查询如下所示:
{
"size": 0,
"track_scores": false,
"_source": {
"exclude": [
"*"
]
},
"aggs": {
"trackedLocationsFromThePast24HoursAtGivenLocation": {
"filter": {
"and": {
"filters": [
{
"range": {
"createdAtUtc": {
"gte": "2015-07-18T07:25:05.992"
}
}
},
{
"geo_distance": {
"location": "37.80986, -122.476995",
"distance": 1.0,
"unit": "km",
"optimize_bbox": "indexed"
}
}
]
}
},
"aggs": {
"userIds": {
"terms": {
"field": "userId",
"size": 100
}
}
}
}
},
"query": {
"match_all": {}
}
}
elasticClient.Index(trackedLocation, descriptor => descriptor
.Index(indexName)
.Parent(parent.Id.ToString()));
var userIdsFilteredQueryForLast24HoursAndLocation = elasticClient.Search<User>(search => search
.Index(indexName)
.MatchAll()
.Source(false)
.TrackScores(false)
.Filter(outerFilter => outerFilter.HasChild<TrackedLocation>(childFilterDescriptor => childFilterDescriptor
.Filter(filterDescriptor => filterDescriptor
.And(
andCombinedFilter1 => andCombinedFilter1
.Range(dateRangeFilter => dateRangeFilter
.GreaterOrEquals(DateTime.UtcNow.Subtract(TimeSpan.FromDays(1))) // filter on instances created/indexed in the past 24 hours
.OnField(trackedLocation => trackedLocation.CreatedAtUtc)),
andCombinedFilter2 => andCombinedFilter2 // and the second filter here is the GeoDistance one.. 1km away from a given .Location(...,...)
.GeoDistance(trackedLocation => trackedLocation.Location, geoDistanceFilterDescriptor => geoDistanceFilterDescriptor
.Distance(1, GeoUnit.Kilometers)
.Location(37.809860, -122.476995)
.Optimize(GeoOptimizeBBox.Indexed)))))));
{
"track_scores": false,
"_source": {
"exclude": [
"*"
]
},
"query": {
"match_all": {}
},
"filter": {
"has_child": {
"type": "trackedlocation",
"filter": {
"and": {
"filters": [
{
"range": {
"createdAtUtc": {
"gte": "2015-07-18T08:58:02.664"
}
}
},
{
"geo_distance": {
"location": "37.80986, -122.476995",
"distance": 1.0,
"unit": "km",
"optimize_bbox": "indexed"
}
}
]
}
}
}
}
}
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.0,
"hits": [
{
"_index": "sampleindex",
"_type": "user",
"_id": "54ccbccd-eb2a-4a93-9be3-363b83cd3b21",
"_score": 1.0,
"_source": {}
},
{
"_index": "locationtracking____sampleindex",
"_type": "user",
"_id": "42482b3b-d4c7-4a92-bf59-a4c25d707835",
"_score": 1.0,
"_source": {}
}
]
}
}
搜索本身是针对与.HasChild筛选器组合的用户实例的。。这同样与聚合逻辑相同(按日期和位置)
例如,原始响应如下所示:
{
"size": 0,
"track_scores": false,
"_source": {
"exclude": [
"*"
]
},
"aggs": {
"trackedLocationsFromThePast24HoursAtGivenLocation": {
"filter": {
"and": {
"filters": [
{
"range": {
"createdAtUtc": {
"gte": "2015-07-18T07:25:05.992"
}
}
},
{
"geo_distance": {
"location": "37.80986, -122.476995",
"distance": 1.0,
"unit": "km",
"optimize_bbox": "indexed"
}
}
]
}
},
"aggs": {
"userIds": {
"terms": {
"field": "userId",
"size": 100
}
}
}
}
},
"query": {
"match_all": {}
}
}
elasticClient.Index(trackedLocation, descriptor => descriptor
.Index(indexName)
.Parent(parent.Id.ToString()));
var userIdsFilteredQueryForLast24HoursAndLocation = elasticClient.Search<User>(search => search
.Index(indexName)
.MatchAll()
.Source(false)
.TrackScores(false)
.Filter(outerFilter => outerFilter.HasChild<TrackedLocation>(childFilterDescriptor => childFilterDescriptor
.Filter(filterDescriptor => filterDescriptor
.And(
andCombinedFilter1 => andCombinedFilter1
.Range(dateRangeFilter => dateRangeFilter
.GreaterOrEquals(DateTime.UtcNow.Subtract(TimeSpan.FromDays(1))) // filter on instances created/indexed in the past 24 hours
.OnField(trackedLocation => trackedLocation.CreatedAtUtc)),
andCombinedFilter2 => andCombinedFilter2 // and the second filter here is the GeoDistance one.. 1km away from a given .Location(...,...)
.GeoDistance(trackedLocation => trackedLocation.Location, geoDistanceFilterDescriptor => geoDistanceFilterDescriptor
.Distance(1, GeoUnit.Kilometers)
.Location(37.809860, -122.476995)
.Optimize(GeoOptimizeBBox.Indexed)))))));
{
"track_scores": false,
"_source": {
"exclude": [
"*"
]
},
"query": {
"match_all": {}
},
"filter": {
"has_child": {
"type": "trackedlocation",
"filter": {
"and": {
"filters": [
{
"range": {
"createdAtUtc": {
"gte": "2015-07-18T08:58:02.664"
}
}
},
{
"geo_distance": {
"location": "37.80986, -122.476995",
"distance": 1.0,
"unit": "km",
"optimize_bbox": "indexed"
}
}
]
}
}
}
}
}
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.0,
"hits": [
{
"_index": "sampleindex",
"_type": "user",
"_id": "54ccbccd-eb2a-4a93-9be3-363b83cd3b21",
"_score": 1.0,
"_source": {}
},
{
"_index": "locationtracking____sampleindex",
"_type": "user",
"_id": "42482b3b-d4c7-4a92-bf59-a4c25d707835",
"_score": 1.0,
"_source": {}
}
]
}
}
。。它返回过去一天在给定位置具有TrackedLocation的用户的(正确的)用户(ID)点击集。太好了
我现在将继续使用这个解决方案,而不是聚合解决方案。。它是以ES中父母/子女关系的“成本”为代价的,但总的来说,它似乎更符合逻辑。Hi Jörg。您可以发布您尝试过的示例查询吗?我对如何搜索感兴趣。映射也很重要,如果在用户>跟踪位置之间的elasticsearch中设置了父级>子级关系,那么您应该能够完全使用哪个嵌套来获取所需内容supports@MartijnLaarman这些TrackedLocation文档不在任何用户文档之间的父级>子级关系中,也不在任何用户文档之间需要从应用程序的角度出发(应用程序不需要搜索用户)。。我只想将每个.UserId(如果有的话)过滤到一个“最新的”。我正在寻找的内容听起来很像top_hits(),网站说它还没有实现,但看起来它已经实现了()Hi Jörg。您可以发布您尝试过的示例查询吗?我对如何搜索感兴趣。映射也很重要,如果在用户>跟踪位置之间的elasticsearch中设置了父级>子级关系,那么您应该能够完全使用哪个嵌套来获取所需内容supports@MartijnLaarman这些TrackedLocation文档不在任何用户文档之间的父级>子级关系中,也不在任何用户文档之间需要从应用程序的角度出发(应用程序不需要搜索用户)。。我只想将每个.UserId(如果有的话)过滤到一个“最新的”。我要查找的内容听起来很像top_hits(),网站说它还没有实现,但看起来它已经实现了()