Python 3.x 如何使用pyspark将spark与hive连接?
我正在尝试使用Python 3.x 如何使用pyspark将spark与hive连接?,python-3.x,hive,pyspark,pyspark-sql,thrift-protocol,Python 3.x,Hive,Pyspark,Pyspark Sql,Thrift Protocol,我正在尝试使用pyspark远程读取配置单元表。它指出无法连接到配置单元Metastore客户端的错误 我已经阅读了SO和其他来源的多个答案,它们大多是配置,但没有一个能够解决我无法远程连接的原因。我阅读了并观察到,在不更改任何配置文件的情况下,我们可以将spark连接到hive。注意:我已通过端口转发了一台运行hive的机器,并将其提供给localhost:10000。我甚至使用presto连接了相同的程序,并且能够在hive上运行查询 代码是: from pyspark import Spa
pyspark
远程读取配置单元表。它指出无法连接到配置单元Metastore客户端的错误
我已经阅读了SO和其他来源的多个答案,它们大多是配置,但没有一个能够解决我无法远程连接的原因。我阅读了并观察到,在不更改任何配置文件的情况下,我们可以将spark连接到hive
。注意:我已通过端口转发了一台运行hive
的机器,并将其提供给localhost:10000
。我甚至使用presto
连接了相同的程序,并且能够在hive
上运行查询
代码是:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://localhost:9083")
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.enableHiveSupport()
.getOrCreate())
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
df.write.saveAsTable('example')
我希望输出是对正在保存的表的确认,但相反,我面对的是
抽象错误是:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 775, in saveAsTable
self._jwrite.saveAsTable(name)
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
当我通过命令检查端口10000和9083时:
aviral@versinator:~/testing-spark-hive$ nc -zv localhost 10000
Connection to localhost 10000 port [tcp/webmin] succeeded!
aviral@versinator:~/testing-spark-hive$ nc -zv localhost 9083
Connection to localhost 9083 port [tcp/*] succeeded!
在运行脚本时,出现以下错误:
Caused by: java.net.UnknownHostException: ip-172-16-1-101.ap-south-1.compute.internal
... 45 more
关键在于让配置单元在创建spark会话本身的同时进行存储配置
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)
应该注意的是,不需要对spark conf进行任何更改,即使像AWS Glue这样的无服务器服务也可以有这样的连接
完整代码:
from pyspark import SparkContext, SparkConf
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, HiveContext
"""
SparkSession ss = SparkSession
.builder()
.appName(" Hive example")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
"""
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
# Write into Hive
#df.write.saveAsTable('example')
df_load = sparkSession.sql('SELECT * FROM example')
df_load.show()
print(df_load.show())
你可能会有一些想法不,没有。它显示“无法连接到metastore服务器”。
from pyspark import SparkContext, SparkConf
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, HiveContext
"""
SparkSession ss = SparkSession
.builder()
.appName(" Hive example")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
"""
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
# Write into Hive
#df.write.saveAsTable('example')
df_load = sparkSession.sql('SELECT * FROM example')
df_load.show()
print(df_load.show())