Optimization 使用ElasticSearch Scroll API时，如何现场优化时间参数？_Optimization_<img Src="//i.stack.imgur.com/RUiNP.png" Height="16" Width="18" Alt="" Class="sponsor Tag Img">elasticsearch_Pagination

Optimization 使用ElasticSearch Scroll API时，如何现场优化时间参数？

optimization pagination

Optimization 使用ElasticSearch Scroll API时，如何现场优化时间参数？,optimization,elasticsearch,pagination,Optimization,elasticsearch,Pagination,我正在使用elasticsearch scroll api返回大量文档。据报道, “每次运行滚动请求时，滚动到期时间都会刷新，因此它只需要足够长的时间来处理当前批结果，而不是处理与查询匹配的所有文档。超时非常重要，因为保持滚动窗口打开会消耗资源，我们希望在不再需要时立即释放这些资源。设置超时可使Elasticsearch在短时间不活动后自动释放资源。” 我的问题是如何优化时间参数？我曾经遇到过一些需要请求和处理~600页的实例，但它在第300页上会失败（这是很长的一段路！）。我怀疑如果我可以优化

我正在使用elasticsearch scroll api返回大量文档。据报道,

“每次运行滚动请求时，滚动到期时间都会刷新，因此它只需要足够长的时间来处理当前批结果，而不是处理与查询匹配的所有文档。超时非常重要，因为保持滚动窗口打开会消耗资源，我们希望在不再需要时立即释放这些资源。设置超时可使Elasticsearch在短时间不活动后自动释放资源。”

我的问题是如何优化时间参数？我曾经遇到过一些需要请求和处理~600页的实例，但它在第300页上会失败（这是很长的一段路！）。我怀疑如果我可以优化传递的时间参数，它将更有效地使用ES资源，并且不容易失败。此代码正在群集上测试，但可能会移植到许多其他群集，因此我希望时间参数的优化能够自适应群集。此外，我不希望o在ES集群上使用的资源比我需要的多b/c其他用户可能也会使用它

这是我的想法。在最初的滚动请求中，传递一个慷慨的时间参数，比如说

5m

，然后传递返回第一页结果所需的时间。然后在第二个滚动请求中，我们传递一个时间参数，该参数只比第一个请求所需的观察时间大一点。归纳地说，每个页面都需要重新请求st的时间将略大于之前观察到的页面的完成时间。这假设由于每个页面返回相同数量的文档（在我的情况下几乎相同大小），因此返回该页面所需的时间与之前观察到的页面大致相同。这一假设成立吗

是否有更智能的方法来调整时间参数？对于这个问题，大小参数（在上面的想法中，大小参数保持不变）.

好的，我做了一些数据分析，从经验中发现了一些东西。对于许多不同的大小，我运行了10-20页的scroll api查询。对于固定大小，返回页面所需的时间大致为高斯分布，其平均值如下所示

means =  {1000: 6.0284869194030763,
 1500: 7.9487858772277828,
 2000: 12.139444923400879,
 2500: 18.494202852249146,
 3000: 22.169868159294129,
 3500: 28.091009926795959,
 4000: 36.068559408187866,
 5000: 53.229292035102844}

我的下一个想法是，这可能取决于机器上是否正在运行其他查询，因此我运行了一个实验，其中一半的页面是来自ES的唯一请求，另一半是在运行第二个滚动查询时。时间似乎没有改变

最后，由于时间将取决于给定的ES配置和带宽等，因此我提出了此解决方案

为初始页面设置一个宽大的页面时间

每页时间

在观测到的时间+一点点时间和初始时间之间使用加权运行平均值（因此您的时间参数总是比需要的大一点点，但减小到平均值+一点点时间）。以下是一个示例：

尝试=0

size=3000

等待时间=2##慷慨的开始时间
返回的点击次数={}##页面，点击次数列表而<3时：尝试：打印“\n\t滚动大小=%s的警报滚动查询…”。。。“%（大小） page=client.search（index=index，doc\u type=doc\u type，body=q，scroll=1m，search\u type=scan，size=size）

    sid = page['_scroll_id'] ## scroll id
    total_hits = page['hits']['total'] ## how many results there are. 
    print "\t\t There are %s hits total." %(total_hits)

    p = 0 ## page count 
    doc_count = 0 ## document count 
    # Start scrolling
    while (scroll_size > 0):
        p += 1
        print "\t\t Scrolling to page %s ..." % p
        start = time.time()
        page = client.scroll(scroll_id = sid, scroll = str(wait_time) + 'm')
        end = time.time()

        ## update wait_time using a weighted running average. 
        wait_time =  ( (end - start + 10) + float(wait_time * p) ) / (p+1) 
        print "\t\t Page %s took %s seconds. We change the time to %s" %(p, end - start, wait_time)

        sid = page['_scroll_id'] # Update the scroll ID
        scroll_size = len(page["hits"]["hits"]) ## no. of hits returned on this page

        print "\t\t Page %s has returned %s hits. Storing .." %( p, scroll_size )
        returned_hits[p] = page['hits']['hits']

        doc_count += scroll_size ## update the total count of docs processed
        print "\t\t Returned and stored %s docs of %s \n" %(doc_count, total_hits)

    tries = 3   ## set tries to three so we exit the while loop! 

except: 
    e = sys.exc_info()[0]
    print "\t\t ---- Error on try %s\n\t\t size was %s, wait_time was %s min, \n\t\terror message = %s" %(tries , _size, wait_time, e) 

    tries += 1 ## increment tries, and do it again until 3 tries. 
    # wait_time *= 2 ## double the time interval for the next go round 
    size = int(.8 * size) ## lower size of docs per shard returned. 

    if tries == 3: 
        print "\t\t three strikes and you're out! (failed three times in a row to execute the alert query). Exiting. "

    else: 
        print '\t\t ---- trying again for the %s-th time ...' %( tries + 1 )