Elasticsearch批量插入w/Python-套接字超时错误_Python_<img Src="//i.stack.imgur.com/RUiNP.png" Height="16" Width="18" Alt="" Class="sponsor Tag Img">elasticsearch

Elasticsearch批量插入w/Python-套接字超时错误

python

Elasticsearch批量插入w/Python-套接字超时错误,python,elasticsearch,Python,elasticsearch,ElasticSearch 7.10.2 Python 3.8.5 弹性搜索py 7.12.1 我正在尝试使用ElasticSearch py bulk helper向ElasticSearch批量插入100000条记录以下是Python代码： import sys import datetime import json import os import logging from elasticsearch import Elasticsearch from elasticsearch.hel

ElasticSearch 7.10.2

Python 3.8.5

弹性搜索py 7.12.1

我正在尝试使用ElasticSearch py bulk helper向ElasticSearch批量插入100000条记录

以下是Python代码：

import sys
import datetime
import json
import os
import logging
from elasticsearch import Elasticsearch
from elasticsearch.helpers import streaming_bulk

     # ES Configuration start
        es_hosts = [
        "http://localhost:9200",]
        es_api_user = 'user'
        es_api_password = 'pw'
        index_name = 'index1'
        chunk_size = 10000
        errors_before_interrupt = 5
        refresh_index_after_insert = False
        max_insert_retries = 3
        yield_ok = False  # if set to False will skip successful documents in the output
    
        # ES Configuration end
        # =======================
    
        filename = file.json
    
        logging.info('Importing data from {}'.format(filename))
    
        es = Elasticsearch(
            es_hosts,
            #http_auth=(es_api_user, es_api_password),
            sniff_on_start=True,  # sniff before doing anything
            sniff_on_connection_fail=True,  # refresh nodes after a node fails to respond
            sniffer_timeout=60,  # and also every 60 seconds
            retry_on_timeout=True,  # should timeout trigger a retry on different node?
        )
    
    
        def data_generator():
            f = open(filename)
            for line in f:
                yield {**json.loads(line), **{
                    "_index": index_name,
                }}
    
    
        errors_count = 0
    
        for ok, result in streaming_bulk(es, data_generator(), chunk_size=chunk_size, refresh=refresh_index_after_insert,
                                         max_retries=max_insert_retries, yield_ok=yield_ok):
            if ok is not True:
                logging.error('Failed to import data')
                logging.error(str(result))
                errors_count += 1
    
                if errors_count == errors_before_interrupt:
                    logging.fatal('Too many import errors, exiting with error code')
                    exit(1)
                    
        print("Documents loaded to Elasticsearch")

当json文件包含少量文档（~100）时，此代码将正常运行。但我刚刚用一个包含10万个文档的文件对它进行了测试，我得到了以下错误：

WARNING:elasticsearch:POST http://127.0.0.1:9200/_bulk?refresh=false [status:N/A request:10.010s]
Traceback (most recent call last):
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 1347, in getresponse
    response.begin()
  File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Users/me/opt/anaconda3/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/elasticsearch/connection/http_urllib3.py", line 251, in perform_request
    response = self.pool.urlopen(
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/util/retry.py", line 386, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 428, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 335, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=10)

警告：elasticsearch:POSThttp://127.0.0.1:9200/_bulk?refresh=false [状态：不适用请求：10.010s]
回溯（最近一次呼叫最后一次）：
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”，第426行，在请求中
六、从（e，无）中提高
文件“”，第3行，从
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”，第421行，在请求中
httplib_response=conn.getresponse（）
文件“/Users/me/opt/anaconda3/lib/python3.8/http/client.py”，第1347行，在getresponse中
response.begin（）
文件“/Users/me/opt/anaconda3/lib/python3.8/http/client.py”，第307行，在begin中
版本、状态、原因=self.\u读取\u状态（）
文件“/Users/me/opt/anaconda3/lib/python3.8/http/client.py”，第268行，处于读取状态
line=str（self.fp.readline（_MAXLINE+1），“iso-8859-1”）
readinto中的文件“/Users/me/opt/anaconda3/lib/python3.8/socket.py”，第669行
返回自我。将袜子重新放入（b）
socket.timeout:超时
在处理上述异常期间，发生了另一个异常：
回溯（最近一次呼叫最后一次）：
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/elasticsearch/connection/http_urllib3.py”，执行请求中的第251行
响应=self.pool.urlopen(
urlopen中的文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”，第726行
重试次数=重试次数。增量(
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/util/retry.py”，第386行，增量
升起六个。重新升起（类型（错误），错误，_stacktrace）
文件“/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/packages/six.py”，第735行，重新登录
增值
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”，第670行，在urlopen中
httplib\u response=self.\u发出请求(
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”，第428行，在请求中
self.\u raise\u timeout（err=e，url=url，timeout\u value=read\u timeout）
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”，第335行，在_raise_timeout中
引发ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError:HTTPConnectionPool（主机='127.0.0.1'，端口=9200）：读取超时。（读取超时=10）

我必须承认这一点有点过头了。我通常不喜欢在这里粘贴大型错误消息，但我不确定这条消息的相关内容

我忍不住想，我可能需要调整

es

对象中的一些参数？或者配置变量？我对这些参数了解不够，无法自己做出明智的决定

最后但并非最不重要的一点——看起来有些文档还是加载到了ES索引中。但更奇怪的是，当json文件只有100k时，计数显示为110k。

TL；DR: 将

chunk\u size

从10000减少到默认值500，我希望它能正常工作。如果自动重试会产生重复，您可能希望禁用自动重试

怎么搞的？创建

Elasticsearch

对象时，您指定了

chunk\u size=10000

。这意味着

streaming\u bulk

调用将尝试插入10000个元素的区块。与Elasticsearch的连接有一个可配置的超时。因此，如果您的Elasticsearch服务器处理10000个元素所需的时间超过10秒如果要插入ts，将发生超时，这将作为错误处理

创建

Elasticsearch

对象时，还将

retry\u on\u timeout

指定为True，并在

streaming\u bulk\u调用中设置max\u retries=max\u insert\u retries
，即3
这意味着，当发生这样的超时时，库将尝试重新连接3次，但是，当插入之后仍有超时时，它将给出您注意到的错误。（）
此外，当超时发生时，库无法知道文档是否已成功插入，因此它必须假设文档未成功插入。因此，它将再次尝试插入相同的文档。我不知道输入行的外观如何，但如果它们不包含，则会在索引中创建重复项。您可能希望发泄一下——要么添加某种类型的\u id
，要么禁用自动重试并手动处理
怎么办？
有两种方法可以做到这一点：

增加超时时间
减小块大小

默认情况下，

chunk\u size

设置为500。您的10000要高得多。如果将其增加到500以上，我不认为您可以获得高性能增益，因此我建议您在此处使用默认值500。如果500仍然因超时而失败，您甚至可能希望进一步减少它。如果您要索引的文档已过期，则可能会发生这种情况多愁善感

您还可以增加

streaming\u bulk

调用的超时时间，或者增加

es

对象的超时时间。要仅更改

streaming\u bulk

调用的超时时间，请执行以下操作：

对于ok，导致流式处理(
锿，
数据生成器（），
块大小=块大小，
刷新=插入后刷新索引，
请求超时=60*3，#3分钟
收益率（正常=收益率）：
#像你那样处理
通过

然而，这也意味着elasticsearch节点故障只有在这个更高的超时之后才会被检测到