Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 无法使用spark结构化流反序列化avro消息,其中键为字符串序列化,值为avro_Apache Spark_Apache Kafka_Avro_Spark Structured Streaming_Confluent Schema Registry - Fatal编程技术网

Apache spark 无法使用spark结构化流反序列化avro消息,其中键为字符串序列化,值为avro

Apache spark 无法使用spark结构化流反序列化avro消息,其中键为字符串序列化,值为avro,apache-spark,apache-kafka,avro,spark-structured-streaming,confluent-schema-registry,Apache Spark,Apache Kafka,Avro,Spark Structured Streaming,Confluent Schema Registry,使用Spark 2.4.0 接收架构的合流架构注册表 消息键在Avro中以字符串和值进行序列化,因此我尝试使用io.confluent.kafka.serializers.kafkavrodeserializer反序列化值,但它不起作用。有人能检查一下我的代码,看看有什么问题吗 已导入的库: import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient import io.confluent.kafka.se

使用Spark 2.4.0

接收架构的合流架构注册表

消息键在Avro中以字符串和值进行序列化,因此我尝试使用io.confluent.kafka.serializers.kafkavrodeserializer反序列化值,但它不起作用。有人能检查一下我的代码,看看有什么问题吗

已导入的库:

import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import org.apache.avro.generic.GenericRecord
import org.apache.kafka.common.serialization.Deserializer
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{ Encoder, SparkSession}
代码体

    val topics = "test_topic"
    val spark: SparkSession = SparkSession.builder
      .config("spark.streaming.stopGracefullyOnShutdown", "true")
      .config("spark.streaming.backpressure.enabled", "true")
      .config("spark.streaming.kafka.maxRatePerPartition", 2170)
      .config("spark.streaming.kafka.maxRetries", 1)
      .config("spark.streaming.kafka.consumer.poll.ms", "600000")
      .appName("SparkStructuredStreamAvro")
      .config("spark.sql.streaming.checkpointLocation", "/tmp/new_checkpoint/")
      .enableHiveSupport()
      .getOrCreate


    //add settings for schema registry url, used to get deser
    val schemaRegUrl = "http://xx.xx.xx.xxx:xxxx"
    val client = new CachedSchemaRegistryClient(schemaRegUrl, 100)

    //subscribe to kafka
    val df = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "xx.xx.xxxx")
      .option("subscribe", "test.topic")
      .option("kafka.startingOffsets", "latest")
      .option("group.id", "use_a_separate_group_id_for_each_stream")
      .load()

    //add confluent kafka avro deserializer, needed to read messages appropriately
    val deser = new KafkaAvroDeserializer(client).asInstanceOf[Deserializer[GenericRecord]]

    //needed to convert column select into Array[Bytes]
    import spark.implicits._

    val results = df.select(col("value").as[Array[Byte]]).map { rawBytes: Array[Byte] =>
      //read the raw bytes from spark and then use the confluent deserializer to get the record back

      val decoded = deser.deserialize(topics, rawBytes)
      val recordId = decoded.get("nameId").asInstanceOf[org.apache.avro.util.Utf8].toString
      recordId
    }


    results.writeStream
      .outputMode("append")
      .format("text")
      .option("path", "/tmp/path_new/")
      .option("truncate", "false")
      .start()
      .awaitTermination()
    spark.stop()

它无法反序列化,并且收到错误

Caused by: java.io.NotSerializableException: io.confluent.kafka.serializers.KafkaAvroDeserializer
Serialization stack:
        - object not serializable (class: io.confluent.kafka.serializers.KafkaAvroDeserializer, value: io.confluent.kafka.serializers.KafkaAvroDeserializer@591024db)
        - field (class: ca.bell.wireless.ingest$$anonfun$1, name: deser$1, type: interface org.apache.kafka.common.serialization.Deserializer)
        - object (class ca.bell.wireless.ingest$$anonfun$1, <function1>)
        - element of array (index: 1)

您在映射块外部为Kafkaavroderializer定义了变量('desr')。 这是个例外

尝试按以下方式更改代码:

val brdDeser = spark.sparkContext.broadcast(new KafkaAvroDeserializer(client).asInstanceOf[Deserializer[GenericRecord]])

val results = df.select(col("value").as[Array[Byte]]).map { rawBytes: Array[Byte] =>
      val deser = brdDeser.value
      val decoded = deser.deserialize(topics, rawBytes)
      val recordId = decoded.get("nameId").asInstanceOf[org.apache.avro.util.Utf8].toString
      recordId
    }
val brdDeser = spark.sparkContext.broadcast(new KafkaAvroDeserializer(client).asInstanceOf[Deserializer[GenericRecord]])

val results = df.select(col("value").as[Array[Byte]]).map { rawBytes: Array[Byte] =>
      val deser = brdDeser.value
      val decoded = deser.deserialize(topics, rawBytes)
      val recordId = decoded.get("nameId").asInstanceOf[org.apache.avro.util.Utf8].toString
      recordId
    }