Python 为什么在使用批量助手将批量文档添加到弹性搜索时内存不足?
我将.avro文件转换为JSON格式,然后解析要在弹性搜索集群上索引的特定数据项。每个数据块包含大约1.8GB的数据,大约有500个数据块。内存很快就会用完,但我认为这就是弹性搜索库中的批量帮助程序的用途 我是否遗漏了一些关键细节Python 为什么在使用批量助手将批量文档添加到弹性搜索时内存不足?,python,
elasticsearch,spark-avro,Python,
elasticsearch,Spark Avro,我将.avro文件转换为JSON格式,然后解析要在弹性搜索集群上索引的特定数据项。每个数据块包含大约1.8GB的数据,大约有500个数据块。内存很快就会用完,但我认为这就是弹性搜索库中的批量帮助程序的用途 我是否遗漏了一些关键细节 def connect_elasticsearch(address): es = Elasticsearch([address], verify_certs=False) if not es.ping(): raise ValueErr
def connect_elasticsearch(address):
es = Elasticsearch([address], verify_certs=False)
if not es.ping():
raise ValueError("Connection failed")
return es
#function for running unix command line from python
def run_cmd(args_list):
print(str(datetime.datetime.now())[:19]+': Running system command: {0}'.format(' '.join(args_list)))
proc = subprocess.Popen(args_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
s_output, s_err = proc.communicate()
s_return = proc.returncode
return s_return, s_output, s_err
def gendata(path):
cat = subprocess.Popen(["hdfs", "dfs", "-cat", path], stdout=subprocess.PIPE)
avro_reader = reader(cat.stdout)
for record in avro_reader:
yield {
'_index': 'beta_homepage_survey_2',
"hostname": record['hostname'],
"age": record['ts'],
"text": record['text'],
"metadata": record['metadata'],
"source":path}
es = connect_elasticsearch('http://myurl:9200/')
#find all avro files for the homepage survey via the command line (hdfs commands)
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-ls', '/homepage_survey/chunks/*/*.avro'])
lines = out.split('\n')
for line in lines:
try:
line = str('/' + line.split(" /")[1])
print(str(datetime.datetime.now())[:19]+": Indexing File: "+ line)
helpers.bulk(es,gendata(line))
except Exception as ex:
print(str(datetime.datetime.now())[:19] + ": *** Error indexing chunk:- "+ type(ex).__name__)
continue
print(str(datetime.datetime.now())[:19]+ ": Indexing Complete...")
你检查过阈值是多少吗?我的意思是你能插入的最小块数是多少?还有,es上的堆大小是多少?哪里会出现错误?在python端还是elasticserver端?