Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Spark 2.4.0到_avro/从_avro反序列化不使用Seq().toDF()_Apache Spark_Apache Spark Sql_Avro_Spark Avro - Fatal编程技术网

Apache spark Spark 2.4.0到_avro/从_avro反序列化不使用Seq().toDF()

Apache spark Spark 2.4.0到_avro/从_avro反序列化不使用Seq().toDF(),apache-spark,apache-spark-sql,avro,spark-avro,Apache Spark,Apache Spark Sql,Avro,Spark Avro,我正在测试Spark 2.4.0全新的from_avro和to_avro功能 我创建了一个只有一列和三行的数据帧,用avro对其进行序列化,然后从avro反序列化 如果输入数据集创建为 val input1 = Seq("foo", "bar", "baz").toDF("key") +---+ |key| +---+ |foo| |bar| |baz| +---+ 反序列化仅返回最后一行的N个副本: +---+ |key| +---+ |baz| |baz| |baz| +---+ 如果

我正在测试Spark 2.4.0全新的from_avro和to_avro功能

我创建了一个只有一列和三行的数据帧,用avro对其进行序列化,然后从avro反序列化

如果输入数据集创建为

val input1 = Seq("foo", "bar", "baz").toDF("key")

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+
反序列化仅返回最后一行的N个副本:

+---+
|key|
+---+
|baz|
|baz|
|baz|
+---+
如果我将输入数据集创建为

val input2 = input1.sqlContext.createDataFrame(input1.rdd, input1.schema)
结果是正确的:

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+
示例代码:

import org.apache.spark.sql.avro.{SchemaConverters, from_avro, to_avro}
import org.apache.spark.sql.DataFrame

val input1 = Seq("foo", "bar", "baz").toDF("key")
val input2 = input1.sqlContext.createDataFrame(input1.rdd, input1.schema)

def test_avro(df: DataFrame): Unit = {
  println("input df:")
  df.printSchema()
  df.show()

  val keySchema = SchemaConverters.toAvroType(df.schema).toString
  println(s"avro schema: $keySchema")

  val avroDf = df
    .select(to_avro($"key") as "key")

  println("avro serialized:")
  avroDf.printSchema()
  avroDf.show()

  val output = avroDf
    .select(from_avro($"key", keySchema) as "key")
    .select("key.*")

  println("avro deserialized:")
  output.printSchema()
  output.show()
}

println("############### testing .toDF()")
test_avro(input1)
println("############### testing .createDataFrame()")
test_avro(input2)
结果:

############### testing .toDF()
input df:
root
 |-- key: string (nullable = true)

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+

avro schema: {"type":"record","name":"topLevelRecord","fields":[{"name":"key","type":["string","null"]}]}
avro serialized:
root
 |-- key: binary (nullable = true)

+----------------+
|             key|
+----------------+
|[00 06 66 6F 6F]|
|[00 06 62 61 72]|
|[00 06 62 61 7A]|
+----------------+

avro deserialized:
root
 |-- key: string (nullable = true)

+---+
|key|
+---+
|baz|
|baz|
|baz|
+---+

############### testing .createDataFrame()
input df:
root
 |-- key: string (nullable = true)

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+

avro schema: {"type":"record","name":"topLevelRecord","fields":[{"name":"key","type":["string","null"]}]}
avro serialized:
root
 |-- key: binary (nullable = true)

+----------------+
|             key|
+----------------+
|[00 06 66 6F 6F]|
|[00 06 62 61 72]|
|[00 06 62 61 7A]|
+----------------+

avro deserialized:
root
 |-- key: string (nullable = true)

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+
从测试来看,问题似乎是在反序列化阶段,因为打印avro序列化的df会显示不同的行


我是做错了还是有bug?

似乎是bug。我提交了一份报告,现在已在2.3和2.4分支中修复。

谢谢,我应该自己报告。