Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在Spark Scala中读取Avro二进制(Base64)编码数据_Scala_Apache Spark_Binary_Decoder_Spark Avro - Fatal编程技术网

如何在Spark Scala中读取Avro二进制(Base64)编码数据

如何在Spark Scala中读取Avro二进制(Base64)编码数据,scala,apache-spark,binary,decoder,spark-avro,Scala,Apache Spark,Binary,Decoder,Spark Avro,我正在尝试读取以二进制(Base64)编码并经过snappy压缩的avro文件 avro文件上的Hadoop cat如下所示: Objavro.schema? {"type":"record","name":"ConnectDefault","namespace":"xyz.connect.avro","fields": [{"name":"service","type":"string"},{"name":"timestamp","type":"long"}, {"name":"coun

我正在尝试读取以二进制(Base64)编码并经过snappy压缩的avro文件 avro文件上的Hadoop cat如下所示:

Objavro.schema? 
{"type":"record","name":"ConnectDefault","namespace":"xyz.connect.avro","fields": 
[{"name":"service","type":"string"},{"name":"timestamp","type":"long"}, 
{"name":"count","type":"int"},{"name":"encoderKey","type":{"type":"map","values":"string"}}, 
{"name":"schema","type":"string"},{"name":"data","type":"string"}]}>??n]
我需要从上面的文件中提取并读取“模式”和“数据”。 “模式”与具有多个文件的“数据”关联

我尝试了以下步骤:

1.读取二进制文件

val binaryFilesRDD = sc.binaryFiles("file+0+00724+00731.avro").map { x => ( x._2.toArray) }
binaryFilesRDD: org.apache.spark.rdd.RDD[Array[Byte]] = MapPartitionsRDD[1] at map at 
<console>:24
  • 使用newArray(即Array[Byte])调用以下方法从字节中获取记录
  • 但我有以下错误

        scala> val newDataRecords = getGenericRecordfromByte(newArray,inputDataSchema)
        org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -40
      at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
      at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
      at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
      at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:363)
      at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:355)
      at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:157)
      at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
      at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
      at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
      at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
    
    
    

    请注意

    您可以这样启动火花壳:

    spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.4
    
    spark2-shell --packages org.apache.spark:spark-avro_2.11:2.4.4
    
    或者像这样:

    spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.4
    
    spark2-shell --packages org.apache.spark:spark-avro_2.11:2.4.4
    
    A那么你会:

    spark.read.format("com.databricks.spark.avro").load("/file/path")
    

    对于2.3.x及更早版本():

    然后,在代码中:

    val avro = spark.read.format("com.databricks.spark.avro").load(/path/)
    
    val avro = spark.read.format("com.databricks.spark.avro").load(/path/)