Elasticsearch批量插入w/Python-套接字超时错误
ElasticSearch 7.10.2 Python 3.8.5 弹性搜索py 7.12.1 我正在尝试使用ElasticSearch py bulk helper向ElasticSearch批量插入100000条记录 以下是Python代码:Elasticsearch批量插入w/Python-套接字超时错误,python,
elasticsearch,Python,
elasticsearch,ElasticSearch 7.10.2 Python 3.8.5 弹性搜索py 7.12.1 我正在尝试使用ElasticSearch py bulk helper向ElasticSearch批量插入100000条记录 以下是Python代码: import sys import datetime import json import os import logging from elasticsearch import Elasticsearch from elasticsearch.hel
import sys
import datetime
import json
import os
import logging
from elasticsearch import Elasticsearch
from elasticsearch.helpers import streaming_bulk
# ES Configuration start
es_hosts = [
"http://localhost:9200",]
es_api_user = 'user'
es_api_password = 'pw'
index_name = 'index1'
chunk_size = 10000
errors_before_interrupt = 5
refresh_index_after_insert = False
max_insert_retries = 3
yield_ok = False # if set to False will skip successful documents in the output
# ES Configuration end
# =======================
filename = file.json
logging.info('Importing data from {}'.format(filename))
es = Elasticsearch(
es_hosts,
#http_auth=(es_api_user, es_api_password),
sniff_on_start=True, # sniff before doing anything
sniff_on_connection_fail=True, # refresh nodes after a node fails to respond
sniffer_timeout=60, # and also every 60 seconds
retry_on_timeout=True, # should timeout trigger a retry on different node?
)
def data_generator():
f = open(filename)
for line in f:
yield {**json.loads(line), **{
"_index": index_name,
}}
errors_count = 0
for ok, result in streaming_bulk(es, data_generator(), chunk_size=chunk_size, refresh=refresh_index_after_insert,
max_retries=max_insert_retries, yield_ok=yield_ok):
if ok is not True:
logging.error('Failed to import data')
logging.error(str(result))
errors_count += 1
if errors_count == errors_before_interrupt:
logging.fatal('Too many import errors, exiting with error code')
exit(1)
print("Documents loaded to Elasticsearch")
当json文件包含少量文档(~100)时,此代码将正常运行。但我刚刚用一个包含10万个文档的文件对它进行了测试,我得到了以下错误:
WARNING:elasticsearch:POST http://127.0.0.1:9200/_bulk?refresh=false [status:N/A request:10.010s]
Traceback (most recent call last):
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 1347, in getresponse
response.begin()
File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/Users/me/opt/anaconda3/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/elasticsearch/connection/http_urllib3.py", line 251, in perform_request
response = self.pool.urlopen(
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
retries = retries.increment(
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/util/retry.py", line 386, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 428, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 335, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=10)
警告:elasticsearch:POSThttp://127.0.0.1:9200/_bulk?refresh=false [状态:不适用请求:10.010s]
回溯(最近一次呼叫最后一次):
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”,第426行,在请求中
六、从(e,无)中提高
文件“”,第3行,从
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”,第421行,在请求中
httplib_response=conn.getresponse()
文件“/Users/me/opt/anaconda3/lib/python3.8/http/client.py”,第1347行,在getresponse中
response.begin()
文件“/Users/me/opt/anaconda3/lib/python3.8/http/client.py”,第307行,在begin中
版本、状态、原因=self.\u读取\u状态()
文件“/Users/me/opt/anaconda3/lib/python3.8/http/client.py”,第268行,处于读取状态
line=str(self.fp.readline(_MAXLINE+1),“iso-8859-1”)
readinto中的文件“/Users/me/opt/anaconda3/lib/python3.8/socket.py”,第669行
返回自我。将袜子重新放入(b)
socket.timeout:超时
在处理上述异常期间,发生了另一个异常:
回溯(最近一次呼叫最后一次):
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/elasticsearch/connection/http_urllib3.py”,执行请求中的第251行
响应=self.pool.urlopen(
urlopen中的文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”,第726行
重试次数=重试次数。增量(
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/util/retry.py”,第386行,增量
升起六个。重新升起(类型(错误),错误,_stacktrace)
文件“/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/packages/six.py”,第735行,重新登录
增值
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”,第670行,在urlopen中
httplib\u response=self.\u发出请求(
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”,第428行,在请求中
self.\u raise\u timeout(err=e,url=url,timeout\u value=read\u timeout)
文件“/Users/me/opt/anaconda3/lib/python3.8/site packages/urllib3/connectionpool.py”,第335行,在_raise_timeout中
引发ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError:HTTPConnectionPool(主机='127.0.0.1',端口=9200):读取超时。(读取超时=10)
我必须承认这一点有点过头了。我通常不喜欢在这里粘贴大型错误消息,但我不确定这条消息的相关内容
我忍不住想,我可能需要调整es
对象中的一些参数?或者配置变量?我对这些参数了解不够,无法自己做出明智的决定
最后但并非最不重要的一点——看起来有些文档还是加载到了ES索引中。但更奇怪的是,当json文件只有100k时,计数显示为110k。TL;DR:
将chunk\u size
从10000减少到默认值500,我希望它能正常工作。如果自动重试会产生重复,您可能希望禁用自动重试
怎么搞的?
创建Elasticsearch
对象时,您指定了chunk\u size=10000
。这意味着streaming\u bulk
调用将尝试插入10000个元素的区块。与Elasticsearch的连接有一个可配置的超时。因此,如果您的Elasticsearch服务器处理10000个元素所需的时间超过10秒如果要插入ts,将发生超时,这将作为错误处理
创建Elasticsearch
对象时,还将retry\u on\u timeout
指定为True,并在streaming\u bulk\u调用中设置max\u retries=max\u insert\u retries
,即3
这意味着,当发生这样的超时时,库将尝试重新连接3次,但是,当插入之后仍有超时时,它将给出您注意到的错误。()
此外,当超时发生时,库无法知道文档是否已成功插入,因此它必须假设文档未成功插入。因此,它将再次尝试插入相同的文档。我不知道输入行的外观如何,但如果它们不包含,则会在索引中创建重复项。您可能希望发泄一下——要么添加某种类型的\u id
,要么禁用自动重试并手动处理
怎么办?
有两种方法可以做到这一点:
- 增加
超时时间
- 减小块大小
默认情况下,chunk\u size
设置为500。您的10000要高得多。如果将其增加到500以上,我不认为您可以获得高性能增益,因此我建议您在此处使用默认值500。如果500仍然因超时而失败,您甚至可能希望进一步减少它。如果您要索引的文档已过期,则可能会发生这种情况多愁善感
您还可以增加streaming\u bulk
调用的超时时间,或者增加es
对象的超时时间。要仅更改streaming\u bulk
调用的超时时间,请执行以下操作:
对于ok,导致流式处理(
锿,
数据生成器(),
块大小=块大小,
刷新=插入后刷新索引,
请求超时=60*3,#3分钟
收益率(正常=收益率):
#像你那样处理
通过
然而,这也意味着elasticsearch节点故障只有在这个更高的超时之后才会被检测到