Apache spark 从Apache Spark连接到Hive
我有一个简单的程序,我运行在独立的ClouderaVM上。我在配置单元中创建了一个托管表,我想在ApacheSpark中读取该表,但尚未建立到配置单元的初始连接。请告知 我在IntelliJ中运行这个程序,我已经将hive-site.xml从/etc/hive/conf复制到/etc/spark/conf,即使spark作业没有连接到hive metastoreApache spark 从Apache Spark连接到Hive,apache-spark,hive,apache-spark-sql,cloudera-quickstart-vm,Apache Spark,Hive,Apache Spark Sql,Cloudera Quickstart Vm,我有一个简单的程序,我运行在独立的ClouderaVM上。我在配置单元中创建了一个托管表,我想在ApacheSpark中读取该表,但尚未建立到配置单元的初始连接。请告知 我在IntelliJ中运行这个程序,我已经将hive-site.xml从/etc/hive/conf复制到/etc/spark/conf,即使spark作业没有连接到hive metastore public static void main(String[] args) throws AnalysisException {
public static void main(String[] args) throws AnalysisException {
String master = "local[*]";
SparkSession sparkSession = SparkSession
.builder().appName(ConnectToHive.class.getName())
.config("spark.sql.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse")
.enableHiveSupport()
.master(master).getOrCreate();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel("ERROR");
SQLContext sqlCtx = sparkSession.sqlContext();
HiveContext hiveContext = new HiveContext(sparkSession);
hiveContext.setConf("hive.metastore.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse");
hiveContext.sql("SHOW DATABASES").show();
hiveContext.sql("SHOW TABLES").show();
sparkSession.close();
}
hive> show databases;
OK
default
sxm
temp
Time taken: 0.019 seconds, Fetched: 3 row(s)
hive> use default;
OK
Time taken: 0.015 seconds
hive> show tables;
OK
employee
Time taken: 0.014 seconds, Fetched: 1 row(s)
hive> describe formatted employee;
OK
# col_name data_type comment
id string
firstname string
lastname string
addresses array<struct<street:string,city:string,state:string>>
# Detailed Table Information
Database: default
Owner: cloudera
CreateTime: Tue Jul 25 06:33:01 PDT 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1500989581
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.07 seconds, Fetched: 29 row(s)
hive>
输出如下图所示,希望看到“Employee table”,以便查询。因为我是在单机版上运行的,所以hive metastore位于本地mySQL服务器中
+------------+
|databaseName|
+------------+
| default|
+------------+
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true 是配置单元元存储的配置
public static void main(String[] args) throws AnalysisException {
String master = "local[*]";
SparkSession sparkSession = SparkSession
.builder().appName(ConnectToHive.class.getName())
.config("spark.sql.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse")
.enableHiveSupport()
.master(master).getOrCreate();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel("ERROR");
SQLContext sqlCtx = sparkSession.sqlContext();
HiveContext hiveContext = new HiveContext(sparkSession);
hiveContext.setConf("hive.metastore.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse");
hiveContext.sql("SHOW DATABASES").show();
hiveContext.sql("SHOW TABLES").show();
sparkSession.close();
}
hive> show databases;
OK
default
sxm
temp
Time taken: 0.019 seconds, Fetched: 3 row(s)
hive> use default;
OK
Time taken: 0.015 seconds
hive> show tables;
OK
employee
Time taken: 0.014 seconds, Fetched: 1 row(s)
hive> describe formatted employee;
OK
# col_name data_type comment
id string
firstname string
lastname string
addresses array<struct<street:string,city:string,state:string>>
# Detailed Table Information
Database: default
Owner: cloudera
CreateTime: Tue Jul 25 06:33:01 PDT 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1500989581
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.07 seconds, Fetched: 29 row(s)
hive>
更新
/usr/lib/hive/conf/hive-site.xml不在类路径中,因此它没有读取表,在将其添加到类路径后,它工作正常。。。因为我是从IntelliJ跑来的,所以我有这个问题。。在生产环境中,spark conf文件夹将有到hive-site.xml的链接
17/07/25 11:38:35 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
这是一个提示,表明您没有连接到远程配置单元元存储(已设置为MySQL),并且XML文件在类路径上不正确
在进行SparkSession之前,您可以在不使用XML的情况下以编程方式执行此操作
System.setProperty("hive.metastore.uris", "thrift://METASTORE:9083");
您不再需要创建
HiveContext
。在SparkSession
上调用enablehavesupport
就足够了。尝试调用sparkSession.sql(“显示数据库”).SHOW()代码>没有运气。我试过了。如果您删除.config(“spark.sql.warehouse.dir”,…)
,会怎么样?Spark应自行选择正确的配置。如果没有,您可以共享执行日志吗?在问题中添加了spark日志“spark.sql.warehouse.dir”
是否会影响日志SharedState:Warehouse路径为'file:/home/cloudera/works/JsonHive/spark Warehouse/'。
HADOOP\u CONF\u DIR
environments变量