Apache spark 如何在Azure Databricks中写入HBase?

Apache spark 如何在Azure Databricks中写入HBase?,apache-spark,hbase,azure-hdinsight,azure-databricks,Apache Spark,Hbase,Azure Hdinsight,Azure Databricks,我正在尝试使用“kafka spark hbase”构建lambda体系结构。我使用的是azure云,组件在以下平台上 1.卡夫卡(0.10)-HDinsights 2.Spark(2.4.3)-数据块 3.Hbase(1.2)-HDinsights 所有3个组件都在同一个V-net下,因此连接没有问题。 我正在使用spark结构化流媒体,并成功连接到Kafka作为源 现在,由于spark不提供连接到Hbase的本机支持,我正在使用“spark Hortonworks Connector”将数据

我正在尝试使用“kafka spark hbase”构建lambda体系结构。我使用的是azure云,组件在以下平台上 1.卡夫卡(0.10)-HDinsights 2.Spark(2.4.3)-数据块 3.Hbase(1.2)-HDinsights

所有3个组件都在同一个V-net下,因此连接没有问题。 我正在使用spark结构化流媒体,并成功连接到Kafka作为源

现在,由于spark不提供连接到Hbase的本机支持,我正在使用“spark Hortonworks Connector”将数据写入Hbase,并且我已经在spark 2.4以后提供的“foreachbatch”api中实现了将批写入Hbase的代码

代码如下:

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.expr
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.Dataset
import org.apache.spark.SparkConf
import org.apache.spark.sql.DataFrame

//code dependencies
import com.JEM.conf.HbaseTableConf
import com.JEM.constant.HbaseConstant

//'Spark hortonworks connector' dependency
import org.apache.spark.sql.execution.datasources.hbase._



    //---------Variables--------------//
    val kafkaBroker = "valid kafka broker"
    val topic = "valid kafka topic"
    val kafkaCheckpointLocation = "/checkpointDir"

 //---------code--------------//
    import spark.sqlContext.implicits._

    val kafkaIpStream = spark.readStream.format("kafka")
      .option("kafka.bootstrap.servers", kafkaBroker)
      .option("subscribe", topic)
      .option("checkpointLocation", kafkaCheckpointLocation)
      .option("startingOffsets", "earliest")
      .load()

    val streamToBeSentToHbase = kafkaIpStream.selectExpr("cast (key as String)", "cast (value as String)")
      .withColumn("ts", split($"key", "/")(1))
      .selectExpr("key as rowkey", "ts", "value as val")
      .writeStream
      .option("failOnDataLoss", false)
      .outputMode(OutputMode.Update())
      .trigger(Trigger.ProcessingTime("30 seconds"))
      .foreachBatch { (batchDF: DataFrame, batchId: Long) =>
        batchDF
          .write
          .options(Map(HBaseTableCatalog.tableCatalog -> HbaseTableConf.getRawtableConf(HbaseConstant.hbaseRawTable), HBaseTableCatalog.newTable -> "5"))
          .format("org.apache.spark.sql.execution.datasources.hbase").save
      }.start()

从日志中,我可以看到代码能够成功地获取数据,但当它尝试写入Hbase时,我得到以下异常

19/10/09 12:42:48 ERROR MicroBatchExecution: Query [id = 1a54283d-ab8a-4bf4-af65-63becc166328, runId = 670f90de-8ca5-41d7-91a9-e8d36dfeef66] terminated with error
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/NamespaceNotFoundException
    at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:59)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:72)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:88)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:146)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:134)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:187)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:183)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:134)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:116)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:116)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:111)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:240)
    at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:97)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:170)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:710)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:306)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:292)
    at com.JEM.Importer.StreamingHbaseImporter$.$anonfun$main$1(StreamingHbaseImporter.scala:57)
    at com.JEM.Importer.StreamingHbaseImporter$.$anonfun$main$1$adapted(StreamingHbaseImporter.scala:53)
    at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:36)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:568)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:111)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:240)
    at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:97)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:170)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:566)
    at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:565)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:207)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
    at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175)
    at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169)
    at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:296)
    at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:208)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.NamespaceNotFoundException
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 44 more
19/10/09 12:42:48 INFO SparkContext: Invoking stop() from shutdown hook
19/10/09 12:42:48 INFO AbstractConnector: Stopped Spark@42f85fa4{HTTP/1.1,[http/1.1]}{172.20.170.72:47611}
19/10/09 12:42:48 INFO SparkUI: Stopped Spark web UI at http://172.20.170.72:47611
19/10/09 12:42:48 INFO StandaloneSchedulerBackend: Shutting down all executors
19/10/09 12:42:48 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
19/10/09 12:42:48 INFO SQLAppStatusListener: Execution ID: 1 Total Executor Run Time: 0
19/10/09 12:42:49 INFO SQLAppStatusListener: Execution ID: 0 Total Executor Run Time: 0
19/10/09 12:42:49 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/10/09 12:42:49 INFO MemoryStore: MemoryStore cleared
19/10/09 12:42:49 INFO BlockManager: BlockManager stopped
19/10/09 12:42:49 INFO BlockManagerMaster: BlockManagerMaster stopped
19/10/09 12:42:49 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/10/09 12:42:49 INFO SparkContext: Successfully stopped SparkContext
19/10/09 12:42:49 INFO ShutdownHookManager: Shutdown hook called
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/temporaryReader-79d0f9b8-c380-4141-9ac2-46c257c6c854
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/temporary-d00d2f73-96e3-4a18-9d5c-a9ff76a871bb
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/spark-b92c0171-286b-4863-9fac-16f4ac379da8
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/spark-ef92ca6b-2e7d-4917-b407-4426ad088cee
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/spark-9b138379-fa3a-49db-95dc-436cd7040a95
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/spark-2666c4ff-30a2-4161-868e-137af5fa3787

我都试了些什么:

在databricks中运行spark作业有3种方法:

  • 笔记本:我已经在集群上安装了SHC jar作为库,并将“hbase site.xml”放入我的作业jar中,然后也将其安装到集群上,我还没有将主类代码粘贴到笔记本上,当我运行它时,我能够从SHC加载依赖项,但出现上述错误

  • Jar:这几乎类似于notebook,不同之处在于我将主类和Jar分配给作业并运行它,而不是notebook。给了我同样的错误

  • Spark submit:我创建了一个包含所有依赖项(包括SHC)的uber jar,我将hbase站点文件上传到dbfs路径,并在上面的submit命令中提供,如下所示

  • 但我还是犯了同样的错误。有人能帮忙吗?
    谢谢

    我不知道会出什么问题,但我会尝试在Databricks之外运行该应用程序,看看这是否会导致它。您是否尝试使用良好的ol'
    spark submit
    ?@JacekLaskowski,是的,我尝试使用spark submit在spark HDinsights上运行应用程序,参数与上面相同,工作正常。唯一的区别是,在HDinsights群集上,我已在每个节点上复制了“hbase site.xml”,因此我确信spark可以访问它,当vm在运行时生成时,我不确定是否会发生这种情况,但正如“3”中指定的,我要提到“--files”选项,因此,无论executor在哪个vm上运行,spark都应该能够访问它。这似乎更多的是[azure databricks]本身的问题,不是吗?您应该找到一种方法,用必要的JAR配置笔记本电脑(可能会覆盖任何默认设置)。我已经接近了问题所在,基本问题是SHC需要支持hbase的JAR将数据写入hbase,这些JAR丢失了。“此链接帮助我更接近这个问题。@JacekLaskowski,我试着使用他的连接器,但是这个连接器使用了hbase 2.X依赖项,而我的服务器是1.1.2.2.6.3.84-1,因此我遇到了这里指定的问题:我不知道可能出了什么问题,但我会尝试在Databricks之外运行应用程序,看看这是否会导致它。您是否尝试使用良好的ol'
    spark submit
    ?@JacekLaskowski,是的,我尝试使用spark submit在spark HDinsights上运行应用程序,参数与上面相同,工作正常。唯一的区别是,在HDinsights群集上,我已在每个节点上复制了“hbase site.xml”,因此我确信spark可以访问它,当vm在运行时生成时,我不确定是否会发生这种情况,但正如“3”中指定的,我要提到“--files”选项,因此,无论executor在哪个vm上运行,spark都应该能够访问它。这似乎更多的是[azure databricks]本身的问题,不是吗?您应该找到一种方法,用必要的JAR配置笔记本电脑(可能会覆盖任何默认设置)。我已经接近了问题所在,基本问题是SHC需要支持hbase的JAR将数据写入hbase,这些JAR丢失了。“此链接帮助我更接近这个问题。@JacekLaskowski,我尝试使用他的连接器,但当我的服务器为1.1.2.2.6.3.84-1时,此连接器使用hbase 2.X依赖项,因此我遇到了此处指定的问题:
    [
    "--class","com.JEM.Importer.StreamingHbaseImporter","dbfs:/FileStore/JarPath/jarfile.jar",
    "--packages","com.hortonworks:shc-core:1.1.1-2.1-s_2.11",
    "--repositories","http://repo.hortonworks.com/content/groups/public/",
    "--files","dbfs:/PathToSiteFile/hbase_site.xml"
    ]