Apache spark 从目录流式传输avro文件

Apache spark 从目录流式传输avro文件,apache-spark,avro,spark-structured-streaming,spark-avro,Apache Spark,Avro,Spark Structured Streaming,Spark Avro,我正在尝试从Avro文件目录设置结构化流。我们已经有了一些非流式代码来处理完全相同的数据,所以流式处理最省力的一步就是重用这些代码 为了转移到StructuredStreaming,我尝试了以下方法,它以非流方式工作(使用read而不是readStream),但在流方式中给了我一个序列化错误 导入com.databricks.spark.avro_ 导入org.apache.avro_ 导入org.apache.spark.sql.types_ 导入com.databricks.spark.av

我正在尝试从Avro文件目录设置结构化流。我们已经有了一些非流式代码来处理完全相同的数据,所以流式处理最省力的一步就是重用这些代码

为了转移到StructuredStreaming,我尝试了以下方法,它以非流方式工作(使用
read
而不是
readStream
),但在流方式中给了我一个序列化错误

导入com.databricks.spark.avro_
导入org.apache.avro_
导入org.apache.spark.sql.types_
导入com.databricks.spark.avro_
val schemaStr=“{our_schema_here}”
val parser=new Schema.parser()
val avroSchema=parser.parse(schematr)
val structType=SchemaConverters.toSqlType(avroSchema).数据类型匹配{
案例t:StructType=>Some(t)
case u=>抛出新的RuntimeException(
无法将s“”Avro架构转换为Spark SQL结构类型:
|
|${avroSchema.toString(真)}
|“.stripMargin)
}
val路径=”dbfs://path/to/avro/files/*"
val avroStream=sqlContext
.readStream
.schema(structType.get)
.format(“com.databricks.spark.avro”)
.选项(“maxFilesPerTrigger”,5)
.加载(路径)
.writeStream
.outputMode(“追加”)
.格式(“内存”)
.queryName(“计数”)
.start()
我得到的异常如下所示。注意,我无法获得完整的stacktrace,因为我在Databricks上,无法访问executors日志。我有点不知所措,到底什么是不能序列化的对象

org.apache.spark.SparkException: Task not serializable

at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2125)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:937)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:936)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:291)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2966)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2456)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2456)
at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2950)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:99)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2949)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2456)
at org.apache.spark.sql.execution.streaming.MemorySink.addBatch(memory.scala:217)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(StreamExecution.scala:731)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1$$anonfun$apply$mcV$sp$1.apply(StreamExecution.scala:731)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1$$anonfun$apply$mcV$sp$1.apply(StreamExecution.scala:731)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:99)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply$mcV$sp(StreamExecution.scala:730)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:730)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:730)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:62)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:729)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:328)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:316)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:316)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:62)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:316)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:312)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:226)
Caused by: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1
Serialization stack:

at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 41 more

:您找到解决方案了吗?没有,我切换了上下文,所以也没有处理它。我仍然认为这是一个有趣的问题:)。