Apache kafka SparkStream应用程序读取Kafka主题时出现格式错误的InputException错误

Apache kafka SparkStream应用程序读取Kafka主题时出现格式错误的InputException错误,apache-kafka,spark-streaming,Apache Kafka,Spark Streaming,我对卡夫卡和SparkStreaming很陌生。这个Sparkstreaming使用者读取Kafka(键是string,值是string),当数据不包含任何转义的UTF8字符串时,它工作正常。但当它这样做时,它会失败,并显示以下错误消息: java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(CoderResult.java:281) at

我对卡夫卡和SparkStreaming很陌生。这个Sparkstreaming使用者读取Kafka(键是string,值是string),当数据不包含任何转义的UTF8字符串时,它工作正常。但当它这样做时,它会失败,并显示以下错误消息:

java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:172)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
下面是带有转义UTF8字符串的行的示例(请注意,它包含两个斜杠):

起初我怀疑卡夫卡流编码配置,但改变它没有帮助。最后,我们发现,当对Python UDF返回的数据执行collect时,会发生错误


如果需要更多信息,请发表评论。提前谢谢你

事实证明,我们的JVM编码在数据收集端设置为ISO8859-1(来自Python UDF)。因此,我们最终通过在调用管道中指定utf-8来解决这个问题:

rdd.pipe(command, encoding = Codec.UTF8.name)
rdd.pipe(command, encoding = Codec.UTF8.name)