elasticsearch,Python,elasticsearch" /> elasticsearch,Python,elasticsearch" />

Elasticsearch批量插入w/Python-套接字超时错误

Elasticsearch批量插入w/Python-套接字超时错误,python,elasticsearch,Python,elasticsearch,ElasticSearch 7.10.2 Python 3.8.5 弹性搜索py 7.12.1 我正在尝试使用ElasticSearch py bulk helper向ElasticSearch批量插入100000条记录 以下是Python代码: import sys import datetime import json import os import logging from elasticsearch import Elasticsearch from elasticsearch.hel

ElasticSearch 7.10.2

Python 3.8.5

弹性搜索py 7.12.1

我正在尝试使用ElasticSearch py bulk helper向ElasticSearch批量插入100000条记录

以下是Python代码:

import sys
import datetime
import json
import os
import logging
from elasticsearch import Elasticsearch
from elasticsearch.helpers import streaming_bulk

     # ES Configuration start
        es_hosts = [
        "http://localhost:9200",]
        es_api_user = 'user'
        es_api_password = 'pw'
        index_name = 'index1'
        chunk_size = 10000
        errors_before_interrupt = 5
        refresh_index_after_insert = False
        max_insert_retries = 3
        yield_ok = False  # if set to False will skip successful documents in the output
    
        # ES Configuration end
        # =======================
    
        filename = file.json
    
        logging.info('Importing data from {}'.format(filename))
    
        es = Elasticsearch(
            es_hosts,
            #http_auth=(es_api_user, es_api_password),
            sniff_on_start=True,  # sniff before doing anything
            sniff_on_connection_fail=True,  # refresh nodes after a node fails to respond
            sniffer_timeout=60,  # and also every 60 seconds
            retry_on_timeout=True,  # should timeout trigger a retry on different node?
        )
    
    
        def data_generator():
            f = open(filename)
            for line in f:
                yield {**json.loads(line), **{
                    "_index": index_name,
                }}
    
    
        errors_count = 0
    
        for ok, result in streaming_bulk(es, data_generator(), chunk_size=chunk_size, refresh=refresh_index_after_insert,
                                         max_retries=max_insert_retries, yield_ok=yield_ok):
            if ok is not True:
                logging.error('Failed to import data')
                logging.error(str(result))
                errors_count += 1
    
                if errors_count == errors_before_interrupt:
                    logging.fatal('Too many import errors, exiting with error code')
                    exit(1)
                    
        print("Documents loaded to Elasticsearch")
当json文件包含少量文档(~100)时,此代码将正常运行。但我刚刚用一个包含10万个文档的文件对它进行了测试,我得到了以下错误:

WARNING:elasticsearch:POST http://127.0.0.1:9200/_bulk?refresh=false [status:N/A request:10.010s]
Traceback (most recent call last):
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 1347, in getresponse
    response.begin()
  File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Users/me/opt/anaconda3/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/elasticsearch/connection/http_urllib3.py", line 251, in perform_request
    response = self.pool.urlopen(
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/util/retry.py", line 386, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 428, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 335, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=10)  
警告:elasticsearch:POSThttp://127.0.0.1:9200/_bulk?refresh=false [状态:不适用请求:10.010s]
回溯(最近一次呼叫最后一次):
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”,第426行,在请求中
六、从(e,无)中提高
文件“”,第3行,从
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”,第421行,在请求中
httplib_response=conn.getresponse()
文件“/Users/me/opt/anaconda3/lib/python3.8/http/client.py”,第1347行,在getresponse中
response.begin()
文件“/Users/me/opt/anaconda3/lib/python3.8/http/client.py”,第307行,在begin中
版本、状态、原因=self.\u读取\u状态()
文件“/Users/me/opt/anaconda3/lib/python3.8/http/client.py”,第268行,处于读取状态
line=str(self.fp.readline(_MAXLINE+1),“iso-8859-1”)
readinto中的文件“/Users/me/opt/anaconda3/lib/python3.8/socket.py”,第669行
返回自我。将袜子重新放入(b)
socket.timeout:超时
在处理上述异常期间,发生了另一个异常:
回溯(最近一次呼叫最后一次):
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/elasticsearch/connection/http_urllib3.py”,执行请求中的第251行
响应=self.pool.urlopen(
urlopen中的文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”,第726行
重试次数=重试次数。增量(
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/util/retry.py”,第386行,增量
升起六个。重新升起(类型(错误),错误,_stacktrace)
文件“/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/packages/six.py”,第735行,重新登录
增值
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”,第670行,在urlopen中
httplib\u response=self.\u发出请求(
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”,第428行,在请求中
self.\u raise\u timeout(err=e,url=url,timeout\u value=read\u timeout)
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”,第335行,在_raise_timeout中
引发ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError:HTTPConnectionPool(主机='127.0.0.1',端口=9200):读取超时。(读取超时=10)
我必须承认这一点有点过头了。我通常不喜欢在这里粘贴大型错误消息,但我不确定这条消息的相关内容

我忍不住想,我可能需要调整
es
对象中的一些参数?或者配置变量?我对这些参数了解不够,无法自己做出明智的决定

最后但并非最不重要的一点——看起来有些文档还是加载到了ES索引中。但更奇怪的是,当json文件只有100k时,计数显示为110k。

TL;DR: 将
chunk\u size
从10000减少到默认值500,我希望它能正常工作。如果自动重试会产生重复,您可能希望禁用自动重试

怎么搞的? 创建
Elasticsearch
对象时,您指定了
chunk\u size=10000
。这意味着
streaming\u bulk
调用将尝试插入10000个元素的区块。与Elasticsearch的连接有一个可配置的超时。因此,如果您的Elasticsearch服务器处理10000个元素所需的时间超过10秒如果要插入ts,将发生超时,这将作为错误处理

创建
Elasticsearch
对象时,还将
retry\u on\u timeout
指定为True,并在
streaming\u bulk\u调用中设置
max\u retries=max\u insert\u retries
,即3

这意味着,当发生这样的超时时,库将尝试重新连接3次,但是,当插入之后仍有超时时,它将给出您注意到的错误。()

此外,当超时发生时,库无法知道文档是否已成功插入,因此它必须假设文档未成功插入。因此,它将再次尝试插入相同的文档。我不知道输入行的外观如何,但如果它们不包含,则会在索引中创建重复项。您可能希望发泄一下——要么添加某种类型的
\u id
,要么禁用自动重试并手动处理

怎么办? 有两种方法可以做到这一点:

  • 增加
    超时时间
  • 减小块大小
默认情况下,
chunk\u size
设置为500。您的10000要高得多。如果将其增加到500以上,我不认为您可以获得高性能增益,因此我建议您在此处使用默认值500。如果500仍然因超时而失败,您甚至可能希望进一步减少它。如果您要索引的文档已过期,则可能会发生这种情况多愁善感

您还可以增加
streaming\u bulk
调用的超时时间,或者增加
es
对象的超时时间。要仅更改
streaming\u bulk
调用的超时时间,请执行以下操作:

对于ok,导致流式处理(
锿,
数据生成器(),
块大小=块大小,
刷新=插入后刷新索引,
请求超时=60*3,#3分钟
收益率(正常=收益率):
#像你那样处理
通过
然而,这也意味着elasticsearch节点故障只有在这个更高的超时之后才会被检测到