如何在Spark Scala中读取Avro二进制(Base64)编码数据
我正在尝试读取以二进制(Base64)编码并经过snappy压缩的avro文件 avro文件上的Hadoop cat如下所示:如何在Spark Scala中读取Avro二进制(Base64)编码数据,scala,apache-spark,binary,decoder,spark-avro,Scala,Apache Spark,Binary,Decoder,Spark Avro,我正在尝试读取以二进制(Base64)编码并经过snappy压缩的avro文件 avro文件上的Hadoop cat如下所示: Objavro.schema? {"type":"record","name":"ConnectDefault","namespace":"xyz.connect.avro","fields": [{"name":"service","type":"string"},{"name":"timestamp","type":"long"}, {"name":"coun
Objavro.schema?
{"type":"record","name":"ConnectDefault","namespace":"xyz.connect.avro","fields":
[{"name":"service","type":"string"},{"name":"timestamp","type":"long"},
{"name":"count","type":"int"},{"name":"encoderKey","type":{"type":"map","values":"string"}},
{"name":"schema","type":"string"},{"name":"data","type":"string"}]}>??n]
我需要从上面的文件中提取并读取“模式”和“数据”。
“模式”与具有多个文件的“数据”关联
我尝试了以下步骤:
1.读取二进制文件
val binaryFilesRDD = sc.binaryFiles("file+0+00724+00731.avro").map { x => ( x._2.toArray) }
binaryFilesRDD: org.apache.spark.rdd.RDD[Array[Byte]] = MapPartitionsRDD[1] at map at
<console>:24
scala> val newDataRecords = getGenericRecordfromByte(newArray,inputDataSchema)
org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -40
at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:363)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:355)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:157)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
请注意您可以这样启动火花壳:
spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.4
spark2-shell --packages org.apache.spark:spark-avro_2.11:2.4.4
或者像这样:
spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.4
spark2-shell --packages org.apache.spark:spark-avro_2.11:2.4.4
A那么你会:
spark.read.format("com.databricks.spark.avro").load("/file/path")
对于2.3.x及更早版本(): 然后,在代码中:
val avro = spark.read.format("com.databricks.spark.avro").load(/path/)
val avro = spark.read.format("com.databricks.spark.avro").load(/path/)