elasticsearch,spark-avro,Python,elasticsearch,Spark Avro" /> elasticsearch,spark-avro,Python,elasticsearch,Spark Avro" />

Python 为什么在使用批量助手将批量文档添加到弹性搜索时内存不足?

Python 为什么在使用批量助手将批量文档添加到弹性搜索时内存不足?,python,elasticsearch,spark-avro,Python,elasticsearch,Spark Avro,我将.avro文件转换为JSON格式,然后解析要在弹性搜索集群上索引的特定数据项。每个数据块包含大约1.8GB的数据,大约有500个数据块。内存很快就会用完,但我认为这就是弹性搜索库中的批量帮助程序的用途 我是否遗漏了一些关键细节 def connect_elasticsearch(address): es = Elasticsearch([address], verify_certs=False) if not es.ping(): raise ValueErr

我将.avro文件转换为JSON格式,然后解析要在弹性搜索集群上索引的特定数据项。每个数据块包含大约1.8GB的数据,大约有500个数据块。内存很快就会用完,但我认为这就是弹性搜索库中的批量帮助程序的用途

我是否遗漏了一些关键细节

def connect_elasticsearch(address):
    es = Elasticsearch([address], verify_certs=False)
    if not es.ping():
        raise ValueError("Connection failed")
    return es

#function for running unix command line from python
def run_cmd(args_list):
        print(str(datetime.datetime.now())[:19]+': Running system command: {0}'.format(' '.join(args_list)))
        proc = subprocess.Popen(args_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        s_output, s_err = proc.communicate()
        s_return =  proc.returncode
        return s_return, s_output, s_err
def gendata(path):
  cat = subprocess.Popen(["hdfs", "dfs", "-cat", path], stdout=subprocess.PIPE)
  avro_reader = reader(cat.stdout)
  for record in avro_reader:
    yield {
      '_index': 'beta_homepage_survey_2',
      "hostname": record['hostname'],
      "age": record['ts'],
      "text": record['text'],
      "metadata": record['metadata'],
      "source":path}
es = connect_elasticsearch('http://myurl:9200/')
#find all avro files for the homepage survey via the command line (hdfs commands)
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-ls', '/homepage_survey/chunks/*/*.avro'])
lines = out.split('\n')
for line in lines:
  try:
    line = str('/' + line.split(" /")[1])
    print(str(datetime.datetime.now())[:19]+": Indexing File: "+ line)
    helpers.bulk(es,gendata(line))
  except Exception as ex:
    print(str(datetime.datetime.now())[:19] + ": *** Error indexing chunk:- "+ type(ex).__name__)
    continue
print(str(datetime.datetime.now())[:19]+ ": Indexing Complete...")

你检查过阈值是多少吗?我的意思是你能插入的最小块数是多少?还有,es上的堆大小是多少?哪里会出现错误?在python端还是elasticserver端?