Optimization 使用ElasticSearch Scroll API时,如何现场优化时间参数?

我正在使用elasticsearch scroll api返回大量文档。据报道,





好的,我做了一些数据分析,从经验中发现了一些东西。对于许多不同的大小,我运行了10-20页的scroll api查询。对于固定大小,返回页面所需的时间大致为高斯分布,其平均值如下所示

means =  {1000: 6.0284869194030763,
 1500: 7.9487858772277828,
 2000: 12.139444923400879,
 2500: 18.494202852249146,
 3000: 22.169868159294129,
 3500: 28.091009926795959,
 4000: 36.068559408187866,
 5000: 53.229292035102844}


  • 为初始页面设置一个宽大的页面时间
  • 每页时间
  • 在观测到的时间+一点点时间和初始时间之间使用加权运行平均值(因此您的时间参数总是比需要的大一点点,但减小到平均值+一点点时间)。以下是一个示例:

    返回的点击次数={}##页面,点击次数列表 而<3时: 尝试: 打印“\n\t滚动大小=%s的警报滚动查询…”。。。“%(大小) page=client.search(index=index,doc\u type=doc\u type,body=q,scroll=1m,search\u type=scan,size=size)

  •     sid = page['_scroll_id'] ## scroll id
        total_hits = page['hits']['total'] ## how many results there are. 
        print "\t\t There are %s hits total." %(total_hits)
        p = 0 ## page count 
        doc_count = 0 ## document count 
        # Start scrolling
        while (scroll_size > 0):
            p += 1
            print "\t\t Scrolling to page %s ..." % p
            start = time.time()
            page = client.scroll(scroll_id = sid, scroll = str(wait_time) + 'm')
            end = time.time()
            ## update wait_time using a weighted running average. 
            wait_time =  ( (end - start + 10) + float(wait_time * p) ) / (p+1) 
            print "\t\t Page %s took %s seconds. We change the time to %s" %(p, end - start, wait_time)
            sid = page['_scroll_id'] # Update the scroll ID
            scroll_size = len(page["hits"]["hits"]) ## no. of hits returned on this page
            print "\t\t Page %s has returned %s hits. Storing .." %( p, scroll_size )
            returned_hits[p] = page['hits']['hits']
            doc_count += scroll_size ## update the total count of docs processed
            print "\t\t Returned and stored %s docs of %s \n" %(doc_count, total_hits)
        tries = 3   ## set tries to three so we exit the while loop! 
        e = sys.exc_info()[0]
        print "\t\t ---- Error on try %s\n\t\t size was %s, wait_time was %s min, \n\t\terror message = %s" %(tries , _size, wait_time, e) 
        tries += 1 ## increment tries, and do it again until 3 tries. 
        # wait_time *= 2 ## double the time interval for the next go round 
        size = int(.8 * size) ## lower size of docs per shard returned. 
        if tries == 3: 
            print "\t\t three strikes and you're out! (failed three times in a row to execute the alert query). Exiting. "
            print '\t\t ---- trying again for the %s-th time ...' %( tries + 1 )