Scala 使用HBase接收器的Spark结构化流媒体_Scala_Spark Streaming_Spark Structured Streaming

Scala 使用HBase接收器的Spark结构化流媒体

scala

Scala 使用HBase接收器的Spark结构化流媒体,scala,spark-streaming,spark-structured-streaming,Scala,Spark Streaming,Spark Structured Streaming,我的用例是通过结构化流读取Kafka消息，并使用foreachBatch将这些消息推送到HBase中，方法是使用一些大容量Put来获得比单次Put更高的性能。我可以使用foreach推送消息（多亏了），但不能对foreachBatch操作执行同样的操作有人能帮忙吗？附上下面的代码 KafkaStructured.scala： package com.test import java.math.BigInteger import java.util import com.fasterxml

我的用例是通过结构化流读取Kafka消息，并使用foreachBatch将这些消息推送到HBase中，方法是使用一些大容量Put来获得比单次Put更高的性能。我可以使用foreach推送消息（多亏了），但不能对foreachBatch操作执行同样的操作

有人能帮忙吗？附上下面的代码

KafkaStructured.scala：


package com.test

import java.math.BigInteger
import java.util

import com.fasterxml.jackson.annotation.JsonIgnoreProperties
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.sql._
import org.apache.spark.sql.functions._


object KafkaStructured {

  @JsonIgnoreProperties(ignoreUnknown = true)
  case class Header(field1: String, field2: String, field3: String)

  @JsonIgnoreProperties(ignoreUnknown = true)
  case class Body(fieldx: String)

  @JsonIgnoreProperties(ignoreUnknown = true)
  case class Event(header: Header, body: Body)

  @JsonIgnoreProperties(ignoreUnknown = true)
  case class KafkaResp(event: Event)

  @JsonIgnoreProperties(ignoreUnknown = true)
  case class HBaseDF(field1: String, field2: String, field3: String)


  def main(args: Array[String]): Unit = {

    val jsonSchema = Encoders.product[KafkaResp].schema

    val spark = SparkSession
      .builder()
      .appName("Kafka Spark")
      .getOrCreate()

    val df = spark
      .readStream
      .format("kafka")
      .option...
      .load()

    import spark.sqlContext.implicits._

    val flattenedDf: DataFrame =
      df
        .select($"value".cast("string").as("json"))
        .select(from_json($"json", jsonSchema).as("data"))
        .select("data.event.header.field1", "data.event.header.field2", "data.event.header.field3")

    val hbaseDf = flattenedDf
      .as[HBaseDF]
      .filter(hbasedf => hbasedf != null && hbasedf.field1 != null)

    flattenedDf
      .writeStream
      .option("truncate", "false")
      .option("checkpointLocation", "some hdfs location")
      .format("console")
      .outputMode("append")
      .start()

    def bytes(data: String) = {
      val bytes = data match {
        case data if data != null && !data.isEmpty => Bytes.toBytes(data)
        case _ => Bytes.toBytes("")
      }
      bytes
    }

   
    hbaseDf
      .writeStream
      .foreachBatch(function = (batchDf, batchId) => {
        val putList = new util.ArrayList[Put]()
        batchDf
          .foreach(row => {
            val p: Put = new Put(bytes(row.field1))
            val cfName= bytes("fam1")
            p.addColumn(cfName, bytes("field1"), bytes(row.field1))
            p.addColumn(cfName, bytes("field2"), bytes(row.field2))
            p.addColumn(cfName, bytes("field3"), bytes(row.field3))
            putList.add(p)
          })
        new HBaseBulkForeachWriter[HBaseDF] {
          override val tableName: String = "<my table name>"
        
          override def bulkPut: util.ArrayList[Put] = {
            putList
          }
        }
      }
      )
      .start()

    spark.streams.awaitAnyTermination()
  }

}

foreachBatch允许您在函数内部使用foreachPartition。在

foreachPartition

中执行的代码每个执行器只运行一次

因此，您可以创建一个函数来创建put：

def putValue(key: String, columnName: String, data: Array[Byte]): Put = {
    val put = new Put(Bytes.toBytes(key))
    put.addColumn(Bytes.toBytes("colFamily"), Bytes.toBytes(columnName), data)
  }

然后是一个大容量插入put的函数

def writePutList(putList: List[Put]): Unit = {
    val config: Configuration = HBaseConfiguration.create()
    config.set("hbase.zookeeper.quorum", zookeperUrl)

    val connection: Connection = ConnectionFactory.createConnection(config)
    val table = connection.getTable(TableName.valueOf(tableName))
    table.put(putList.asJava)
    logger.info("INSERT record[s] " + putList.size + " to table " + tableName + " OK.")
    table.close()
    connection.close()
  }

并在

foreachPartition

和

map

 def writeFunction: (DataFrame, Long) => Unit = {
    (batchData, id) => {
      batchData.foreachPartition(
        partition => {  
          val putList = partition.map(
            data =>
             putValue(data.getAs[String]("keyField"), "colName", Bytes.toBytes(data.getAs[String]("valueField")))
          ).toList
         writePutList(putList)
        }
      )
    }
  }

最后使用流式查询中创建的函数：

 df.writeStream
      .queryName("yourQueryName")
      .option("checkpointLocation", checkpointLocation)
      .outputMode(OutputMode.Update())
      .foreachBatch(writeFunction)
      .start()
      .awaitTermination()

 df.writeStream
      .queryName("yourQueryName")
      .option("checkpointLocation", checkpointLocation)
      .outputMode(OutputMode.Update())
      .foreachBatch(writeFunction)
      .start()
      .awaitTermination()