Apache spark 从Apache Spark连接到Hive

Apache spark 从Apache Spark连接到Hive,apache-spark,hive,apache-spark-sql,cloudera-quickstart-vm,Apache Spark,Hive,Apache Spark Sql,Cloudera Quickstart Vm,我有一个简单的程序,我运行在独立的ClouderaVM上。我在配置单元中创建了一个托管表,我想在ApacheSpark中读取该表,但尚未建立到配置单元的初始连接。请告知 我在IntelliJ中运行这个程序,我已经将hive-site.xml从/etc/hive/conf复制到/etc/spark/conf,即使spark作业没有连接到hive metastore public static void main(String[] args) throws AnalysisException {

我有一个简单的程序,我运行在独立的ClouderaVM上。我在配置单元中创建了一个托管表,我想在ApacheSpark中读取该表,但尚未建立到配置单元的初始连接。请告知

我在IntelliJ中运行这个程序,我已经将hive-site.xml从/etc/hive/conf复制到/etc/spark/conf,即使spark作业没有连接到hive metastore

 public static void main(String[] args) throws AnalysisException {
         String master = "local[*]";

         SparkSession sparkSession = SparkSession
                 .builder().appName(ConnectToHive.class.getName())
                 .config("spark.sql.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse")
                 .enableHiveSupport()
                 .master(master).getOrCreate();

         SparkContext context = sparkSession.sparkContext();
         context.setLogLevel("ERROR");

         SQLContext sqlCtx = sparkSession.sqlContext();

         HiveContext hiveContext = new HiveContext(sparkSession);
         hiveContext.setConf("hive.metastore.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse");

         hiveContext.sql("SHOW DATABASES").show();
         hiveContext.sql("SHOW TABLES").show();

         sparkSession.close();
     }
 hive> show databases;
 OK
 default
 sxm
 temp
 Time taken: 0.019 seconds, Fetched: 3 row(s)
 hive> use default;
 OK
 Time taken: 0.015 seconds
 hive> show tables;
 OK
 employee
 Time taken: 0.014 seconds, Fetched: 1 row(s)
 hive> describe formatted employee;
 OK
 # col_name             data_type               comment             

 id                     string                                      
 firstname              string                                      
 lastname               string                                      
 addresses              array<struct<street:string,city:string,state:string>>                       

 # Detailed Table Information        
 Database:              default                  
 Owner:                 cloudera                 
 CreateTime:            Tue Jul 25 06:33:01 PDT 2017     
 LastAccessTime:        UNKNOWN                  
 Protect Mode:          None                     
 Retention:             0                        
 Location:              hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee     
 Table Type:            MANAGED_TABLE            
 Table Parameters:       
    transient_lastDdlTime   1500989581          

 # Storage Information       
 SerDe Library:         org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
 InputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
 OutputFormat:          org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
 Compressed:            No                       
 Num Buckets:           -1                       
 Bucket Columns:        []                       
 Sort Columns:          []                       
 Storage Desc Params:        
    serialization.format    1                   
 Time taken: 0.07 seconds, Fetched: 29 row(s)
 hive> 
输出如下图所示,希望看到“Employee table”,以便查询。因为我是在单机版上运行的,所以hive metastore位于本地mySQL服务器中

 +------------+
 |databaseName|
 +------------+
 |     default|
 +------------+

 +--------+---------+-----------+
 |database|tableName|isTemporary|
 +--------+---------+-----------+
 +--------+---------+-----------+
jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true 是配置单元元存储的配置

 public static void main(String[] args) throws AnalysisException {
         String master = "local[*]";

         SparkSession sparkSession = SparkSession
                 .builder().appName(ConnectToHive.class.getName())
                 .config("spark.sql.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse")
                 .enableHiveSupport()
                 .master(master).getOrCreate();

         SparkContext context = sparkSession.sparkContext();
         context.setLogLevel("ERROR");

         SQLContext sqlCtx = sparkSession.sqlContext();

         HiveContext hiveContext = new HiveContext(sparkSession);
         hiveContext.setConf("hive.metastore.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse");

         hiveContext.sql("SHOW DATABASES").show();
         hiveContext.sql("SHOW TABLES").show();

         sparkSession.close();
     }
 hive> show databases;
 OK
 default
 sxm
 temp
 Time taken: 0.019 seconds, Fetched: 3 row(s)
 hive> use default;
 OK
 Time taken: 0.015 seconds
 hive> show tables;
 OK
 employee
 Time taken: 0.014 seconds, Fetched: 1 row(s)
 hive> describe formatted employee;
 OK
 # col_name             data_type               comment             

 id                     string                                      
 firstname              string                                      
 lastname               string                                      
 addresses              array<struct<street:string,city:string,state:string>>                       

 # Detailed Table Information        
 Database:              default                  
 Owner:                 cloudera                 
 CreateTime:            Tue Jul 25 06:33:01 PDT 2017     
 LastAccessTime:        UNKNOWN                  
 Protect Mode:          None                     
 Retention:             0                        
 Location:              hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee     
 Table Type:            MANAGED_TABLE            
 Table Parameters:       
    transient_lastDdlTime   1500989581          

 # Storage Information       
 SerDe Library:         org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
 InputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
 OutputFormat:          org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
 Compressed:            No                       
 Num Buckets:           -1                       
 Bucket Columns:        []                       
 Sort Columns:          []                       
 Storage Desc Params:        
    serialization.format    1                   
 Time taken: 0.07 seconds, Fetched: 29 row(s)
 hive> 
更新

/usr/lib/hive/conf/hive-site.xml不在类路径中,因此它没有读取表,在将其添加到类路径后,它工作正常。。。因为我是从IntelliJ跑来的,所以我有这个问题。。在生产环境中,spark conf文件夹将有到hive-site.xml的链接

 17/07/25 11:38:35 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
这是一个提示,表明您没有连接到远程配置单元元存储(已设置为MySQL),并且XML文件在类路径上不正确

在进行SparkSession之前,您可以在不使用XML的情况下以编程方式执行此操作

System.setProperty("hive.metastore.uris", "thrift://METASTORE:9083");

您不再需要创建
HiveContext
。在
SparkSession
上调用
enablehavesupport
就足够了。尝试调用
sparkSession.sql(“显示数据库”).SHOW()没有运气。我试过了。如果您删除
.config(“spark.sql.warehouse.dir”,…)
,会怎么样?Spark应自行选择正确的配置。如果没有,您可以共享执行日志吗?在问题中添加了spark日志
“spark.sql.warehouse.dir”
是否会影响日志
SharedState:Warehouse路径为'file:/home/cloudera/works/JsonHive/spark Warehouse/'。
Sparkthanks。。。在spark_主页中,我有一个到hive-site.xml的链接,在这里我指定了所有元存储以及其他详细信息/usr/lib/hive/conf/hive-site.xmlLink或copy,当然可以,但这不是完全必要的。另一种方法是定义
HADOOP\u CONF\u DIR
environments变量