Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala spark结构化流媒体avro至avro和自定义接收器_Scala_Apache Kafka_Avro_Spark Structured Streaming - Fatal编程技术网

Scala spark结构化流媒体avro至avro和自定义接收器

Scala spark结构化流媒体avro至avro和自定义接收器,scala,apache-kafka,avro,spark-structured-streaming,Scala,Apache Kafka,Avro,Spark Structured Streaming,有人能给我介绍一个在S3或任何文件系统中编写avro的好例子或示例吗?我正在使用一个自定义接收器,但我想通过SinkProvider的构造函数传递一些属性映射,我想可以进一步传递到接收器 更新代码: val query = df.mapPartitions { itr => itr.map { row => val rowInBytes = row.getAs[Array[Byte]]("value") MyUtils.deserializeAvro[Generi

有人能给我介绍一个在S3或任何文件系统中编写avro的好例子或示例吗?我正在使用一个自定义接收器,但我想通过SinkProvider的构造函数传递一些属性映射,我想可以进一步传递到接收器

更新代码:

val query = df.mapPartitions { itr =>
  itr.map { row =>
    val rowInBytes = row.getAs[Array[Byte]]("value")
    MyUtils.deserializeAvro[GenericRecord](rowInBytes).toString
  }
}.writeStream
  .format("com.test.MyStreamingSinkProvider")
  .outputMode(OutputMode.Append())
  .queryName("testQ" )
  .trigger(ProcessingTime("10 seconds"))
  .option("checkpointLocation", "my_checkpoint_dir")
  .start()

query.awaitTermination()
接收器提供程序:

class MyStreamingSinkProvider extends StreamSinkProvider {

  override def createSink(sqlContext: SQLContext, parameters: Map[String, String], partitionColumns: Seq[String], outputMode: OutputMode): Sink = {
    new MyStreamingSink
  }
}
class MyStreamingSink extends Sink with Serializable {

  final val log: Logger = LoggerFactory.getLogger(classOf[MyStreamingSink])

  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    //For saving as text doc
    data.rdd.saveAsTextFile("path")

    log.warn(s"Total records processed: ${data.count()}")
    log.warn("Data saved.")
  }
}
接收器:

class MyStreamingSinkProvider extends StreamSinkProvider {

  override def createSink(sqlContext: SQLContext, parameters: Map[String, String], partitionColumns: Seq[String], outputMode: OutputMode): Sink = {
    new MyStreamingSink
  }
}
class MyStreamingSink extends Sink with Serializable {

  final val log: Logger = LoggerFactory.getLogger(classOf[MyStreamingSink])

  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    //For saving as text doc
    data.rdd.saveAsTextFile("path")

    log.warn(s"Total records processed: ${data.count()}")
    log.warn("Data saved.")
  }
}

您应该能够通过
writeStream.option(键,值)
将参数传递到自定义接收器:


在这种情况下,方法
MyStreamingSinkProvider.createSink(…)
中的
参数将包含
键1
键2

我创建了自定义SinkProvider和Sink。我以writeStream的“格式”传递完全限定的类名,它现在对我有效。不过,我还有一个问题。是否有方法将自定义属性传递给提供程序?writeStream中的“format”只接受字符串作为参数。请参阅下面的代码,并建议我是否可以从驱动程序类以构造函数的形式传递属性映射<代码>类MyStreamingSinkProvider(prop:Map[String,String])扩展StreamSinkProvider{override def createSink(sqlContext:sqlContext,参数:Map[String,String],partitionColumns:Seq[String],outputMode:outputMode):Sink={new MyStreamingSink(prop:Map[String,String])}
感谢您的快速响应。我想我已经从先前发布的最初错误开始,用当前的问题@cricket_007完全编辑了我的帖子。知道如何将自定义属性传递给SinkProvider会很好。有什么想法或建议吗??