Apache spark 如何使用带spark的JDBC驱动程序读取Druid数据?

Apache spark 如何使用带spark的JDBC驱动程序读取Druid数据?,apache-spark,jdbc,apache-spark-sql,druid,apache-calcite,Apache Spark,Jdbc,Apache Spark Sql,Druid,Apache Calcite,如何使用spark和Avatica JDBC驱动程序从Druid读取数据? 使用python和Jaydebeapi模块从Druid读取数据,我成功地完成了以下代码 $ python import jaydebeapi conn = jaydebeapi.connect("org.apache.calcite.avatica.remote.Driver", "jdbc:avatica:remote:url=htt

如何使用spark和Avatica JDBC驱动程序从Druid读取数据?

使用python和Jaydebeapi模块从Druid读取数据,我成功地完成了以下代码

$ python
import jaydebeapi

conn = jaydebeapi.connect("org.apache.calcite.avatica.remote.Driver",
                          "jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/",
                          {"user": "druid", "password":"druid"},
                          "/root/avatica-1.17.0.jar",
       )
cur = conn.cursor()
cur.execute("SELECT * FROM INFORMATION_SCHEMA.TABLES")
cur.fetchall()
$ pyspark --jars /root/avatica-1.17.0.jar

df = spark.read.format('jdbc') \
    .option('url', 'jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/') \
    .option("dbtable", 'INFORMATION_SCHEMA.TABLES') \
    .option('user', 'druid') \
    .option('password', 'druid') \
    .option('driver', 'org.apache.calcite.avatica.remote.Driver') \
    .load()
输出为:

[('druid', 'druid', 'wikipedia', 'TABLE'),
('druid', 'INFORMATION_SCHEMA', 'COLUMNS', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'SCHEMATA', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'TABLES', 'SYSTEM_TABLE'),
('druid', 'sys', 'segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'server_segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'servers', 'SYSTEM_TABLE'),
('druid', 'sys', 'supervisors', 'SYSTEM_TABLE'),
('druid', 'sys', 'tasks', 'SYSTEM_TABLE')]  -> default tables
Traceback (most recent call last):
  File "<stdin>", line 8, in <module>
  File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 172, in load
    return self._df(self._jreader.load())
  File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o2999.load.
: java.sql.SQLException: While closing connection
...
Caused by: java.lang.RuntimeException: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
 at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46]
...
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
 at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46] 
...

但我想使用spark和JDBC阅读

我试过了,但在使用下面的类似spark的代码时出现了问题

$ python
import jaydebeapi

conn = jaydebeapi.connect("org.apache.calcite.avatica.remote.Driver",
                          "jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/",
                          {"user": "druid", "password":"druid"},
                          "/root/avatica-1.17.0.jar",
       )
cur = conn.cursor()
cur.execute("SELECT * FROM INFORMATION_SCHEMA.TABLES")
cur.fetchall()
$ pyspark --jars /root/avatica-1.17.0.jar

df = spark.read.format('jdbc') \
    .option('url', 'jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/') \
    .option("dbtable", 'INFORMATION_SCHEMA.TABLES') \
    .option('user', 'druid') \
    .option('password', 'druid') \
    .option('driver', 'org.apache.calcite.avatica.remote.Driver') \
    .load()
输出为:

[('druid', 'druid', 'wikipedia', 'TABLE'),
('druid', 'INFORMATION_SCHEMA', 'COLUMNS', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'SCHEMATA', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'TABLES', 'SYSTEM_TABLE'),
('druid', 'sys', 'segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'server_segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'servers', 'SYSTEM_TABLE'),
('druid', 'sys', 'supervisors', 'SYSTEM_TABLE'),
('druid', 'sys', 'tasks', 'SYSTEM_TABLE')]  -> default tables
Traceback (most recent call last):
  File "<stdin>", line 8, in <module>
  File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 172, in load
    return self._df(self._jreader.load())
  File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o2999.load.
: java.sql.SQLException: While closing connection
...
Caused by: java.lang.RuntimeException: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
 at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46]
...
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
 at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46] 
...
回溯(最近一次呼叫最后一次):
文件“”,第8行,在
加载文件“/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/readwriter.py”,第172行
返回self.\u df(self.\u jreader.load())
文件“/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py”,第1257行,在__
文件“/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/utils.py”,第63行,deco格式
返回f(*a,**kw)
文件“/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py”,第328行,在get\u返回值中
py4j.protocol.Py4JJavaError:调用o2999.load时出错。
:java.sql.SQLException:关闭连接时
...
原因:java.lang.RuntimeException:com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException:Unrecognized字段“rpcMetadata”(类org.apache.calcite.avatica.remote.Service$CloseConnectionResponse),未标记为可忽略(0个已知属性:)
在[Source:{“response”:“closeConnection”,“rpcMetadata”:{“response”:“rpcMetadata”,“serverAddress”:“172.18.0.7:8082”}
;行:1,列:46]
...
原因:com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException:无法识别的字段“rpcMetadata”(class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse),未标记为可忽略(0个已知属性:)
在[Source:{“response”:“closeConnection”,“rpcMetadata”:{“response”:“rpcMetadata”,“serverAddress”:“172.18.0.7:8082”}
;行:1,列:46]
...
注:

  • 我从下载了Avatica jar文件(
    Avatica-1.17.0.jar
  • 我使用和默认设置值安装了Druid服务器

    • 我找到了解决这个问题的另一种方法。我曾经把德鲁伊和火花联系起来

      但是我改变了一些代码,比如在我的环境中使用这些代码

      这是我的环境:

      • 火花:2.4.4
      • 斯卡拉:2.11.12
      • python:python 3.6.8
      • 德鲁伊:
        • 动物园管理员:3.5
        • 德鲁伊:0.17.0
      然而,它有一个问题

      • 如果您至少使用一次spark druid connector,那么从下面使用的所有sql查询(如
        spark.sql(“从tmep_视图中选择*)
        )都将输入到该计划器中
      • 但是,如果您使用dataframe的api,比如
        df.distinct().count()
        ,那么就没有问题了。我还没解