elasticsearch,pyspark,Python,elasticsearch,Pyspark" /> elasticsearch,pyspark,Python,elasticsearch,Pyspark" />

Python 使用pyspark查询Elasticsearch索引:如何指定es.nodes?

Python 使用pyspark查询Elasticsearch索引:如何指定es.nodes?,python,elasticsearch,pyspark,Python,elasticsearch,Pyspark,我尝试使用pyspark查询Elasticsearch索引,但未成功: ] ./bin/pyspark --driver-class-path=jars/elasticsearch-hadoop-2.2.0.jar 在ipython中,spark 2.0.1版: In [1]: es_read_conf = { "es.resource" : "test/docs" , "es.nodes" : ["xx.xx.xx.aa","xx.xx.xx.bb","xx.xx.xx.cc"],"es.p

我尝试使用pyspark查询Elasticsearch索引,但未成功:

] ./bin/pyspark --driver-class-path=jars/elasticsearch-hadoop-2.2.0.jar
在ipython中,spark 2.0.1版:

In [1]: es_read_conf = { "es.resource" : "test/docs" , "es.nodes" : ["xx.xx.xx.aa","xx.xx.xx.bb","xx.xx.xx.cc"],"es.port" : "9200", "es.net.http.auth.user": "myusername", "es.net.http.auth.pass": "mypassword"}
es_rdd = sc.newAPIHadoopRDD(inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_read_conf)
我得到以下错误:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
将es.nodes的python列表转换为Java字符串似乎有问题。我尝试使用一个只包含elasticsearch主节点(“xx.xx.xx.aa”)地址的字符串,但出现了另一个错误:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on [test/docs] failed; server[xx.xx.xx.bb:9202] returned [502|Bad Gateway:]

有时错误指的是数据节点bb,有时指的是cc。有趣的是,如果我多次运行同一个命令,碰巧没有得到任何错误(可能是在仅针对主节点运行查询时)。使用localhost作为唯一的es.nodes运行该命令没有问题。

请参阅es hadoop文档

您需要将以下属性设置为spark conf:

conf.set("es.nodes","<your host>")
conf.set(“es.nodes”,“”)

谢谢您的回答,Lior。我尝试了您的建议,但ir引发了相同的错误。您应该只放置一个具有自我发现功能的节点。。作为字符串(不是列表)传递它,这是我最初做的,但我也尝试使用节点列表。没有成功。