spark structured streaming(java):任务不可序列化

spark structured streaming(java):任务不可序列化,java,apache-spark,spark-streaming,Java,Apache Spark,Spark Streaming,在下面的代码中,我试图从一个kafka主题中读取avro消息,在map方法中,我使用KafkaAvroDecoder fromBytes方法,它似乎导致任务不可序列化异常,如何解码avro消息 公共静态void main(字符串[]args)引发异常{ Properties decoderProps = new Properties(); decoderProps.put("schema.registry.url", SCHEMA_REG_URL); //decoderP

在下面的代码中,我试图从一个kafka主题中读取avro消息,在map方法中,我使用KafkaAvroDecoder fromBytes方法,它似乎导致任务不可序列化异常,如何解码avro消息

公共静态void main(字符串[]args)引发异常{

    Properties decoderProps = new Properties();
    decoderProps.put("schema.registry.url", SCHEMA_REG_URL);
    //decoderProps.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, "true");

    KafkaAvroDecoder decoder = new KafkaAvroDecoder(new VerifiableProperties(decoderProps));


    SparkSession spark = SparkSession
        .builder()
        .appName("JavaCount1").master("local[2]")
        .config("spark.driver.extraJavaOptions", "-Xss4M")
        .getOrCreate();

    Dataset<Row> ds1 = spark
        .readStream()
        .format("kafka")
        .option("kafka.bootstrap.servers", HOSTS)
        .option("subscribe", "systemDec200Message")
        .option("startingOffsets", "earliest")
        .option("maxOffsetsPerTrigger", 1)
        .load();



    Dataset<String> ds2 = ds1.map(m-> {
        GenericData.Record data = (GenericData.Record)decoder.fromBytes((byte[]) m.get(1));

        return "sddasdadasdsadas";
}, Encoders.STRING());





    StreamingQuery query = ds2.writeStream()
        .outputMode("append")
        .format("console")
        .trigger(ProcessingTime.apply(15))
        .start();

    query.awaitTermination();
}
Properties decoderProps=新属性();
decoderProps.put(“schema.registry.url”,schema\u REG\u url);
//decoderProps.put(kafkaavroodeserializerconfig.SPECIFIC_AVRO_READER_CONFIG,“true”);
KafkaAvroDecoder decoder=新的KafkaAvroDecoder(新的可验证属性(decoderProps));
火花会话火花=火花会话
.builder()
.appName(“JavaCount1”).master(“本地[2]”)
.config(“spark.driver.extraJavaOptions”,“-Xss4M”)
.getOrCreate();
数据集ds1=spark
.readStream()
.格式(“卡夫卡”)
.option(“kafka.bootstrap.servers”,主机)
.选项(“订阅”、“systemDec200Message”)
.选项(“起始偏移量”、“最早”)
.选项(“maxOffsetsPerTrigger”,1)
.load();
数据集ds2=ds1.map(m->{
GenericData.Record data=(GenericData.Record)decoder.fromBytes((byte[])m.get(1));
返回“sddasdaas”;
},Encoders.STRING());
StreamingQuery=ds2.writeStream()
.outputMode(“追加”)
.格式(“控制台”)
.trigger(ProcessingTime.apply(15))
.start();
query.waittermination();
}
我得到如下例外情况:


17/04/12 16:51:06信息代码生成器:在329.145119 ms中生成的代码17/04/12 16:51:07错误流执行:查询[id=1d56386c-3fba-4978-8565-6b9c880d4fce,runId=B7BB8D8-b52d-4c14-9dec-bc9cb41f8d77]终止,错误为org.apache.spark.SparkException:任务无法在org.apache.spark.util.ClosureCleaner$序列化可重新启动(ClosureCleaner.scala:298)在org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)在org.apache.spark.spark.SparkContext.clean(SparkContext.scala:2094)在org.apache.spark.spark.rdd.rdd$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)在org.apache.spark.RDD.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:839)在org.apache.spark.RDD.RDDOperationScope$.withScope(RDDOperationScope.scala:151)在org.apache.spark.RDD.RDDOperationScope$.withScope(RDDOperationScope.RDDOperationScope:112)在org.apache.spark.RDD.RDD.RDD.RDD.RDD.RDD.withScope(RDD:362)上在org.apache.spark.sql.execution.whisttagecodegenexecute.doExecute(whisttagecodegenexec.scala:371)org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)在org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)在org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

在lambda范围内(在map调用内)移动KAFKA AVRO解码器声明后,序列化问题消失了但是,现在运行时出现了另一个异常

org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 116, Column 101: No applicable constructor/method found for actual parameters "long"; candidates are: "java.lang.Integer(int)", "java.lang.Integer(java.lang.String)"
    at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174)
    at org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:7559)
    at org.codehaus.janino.UnitCompiler.invokeConstructor(UnitCompiler.java:6505)
    at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4126)
    at org.codehaus.janino.UnitCompiler.access$7600(UnitCompiler.java:185)
    at org.codehaus.janino.UnitCompiler$10.visitNewClassInstance(UnitCompiler.java:3275)
    at org.codehaus.janino.Java$NewClassInstance.accept(Java.java:4085)
    at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
    at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
    at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3571)

对于任何不可序列化的
任务
,堆栈跟踪的一部分会准确地告诉您要序列化的内容第16行是final java.lang.Integer deserializetoobject_value10=deserializetoobject_resultIsNull3?null:new java.lang.Integer(deserializetoobject_argValue3);/*117/deserializetoobject_javaBean.setOffset(反序列化对象_值10);/118*/