Scala JSON到AVRO到JSON
我正在尝试将json文件转换为avro并反转 我的输入文件是Scala JSON到AVRO到JSON,scala,apache-spark,avro,Scala,Apache Spark,Avro,我正在尝试将json文件转换为avro并反转 我的输入文件是 [ { "userId": 1, "firstName": "Krish", "lastName": "Lee", "phoneNumber": "123456", "emailAddress": "krish.lee@ab
[
{
"userId": 1,
"firstName": "Krish",
"lastName": "Lee",
"phoneNumber": "123456",
"emailAddress": "krish.lee@abc.com"
},
{
"userId": 2,
"firstName": "racks",
"lastName": "jacson",
"phoneNumber": "123456",
"emailAddress": "racks.jacson@abc.com"
}
]
{"emailAddress":"krish.lee@abc.com","firstName":"Krish","lastName":"Lee","phoneNumber":"123456","userId":1}
{"emailAddress":"racks.jacson@abc.com","firstName":"racks","lastName":"jacson","phoneNumber":"123456","userId":2}
我的输出文件是
[
{
"userId": 1,
"firstName": "Krish",
"lastName": "Lee",
"phoneNumber": "123456",
"emailAddress": "krish.lee@abc.com"
},
{
"userId": 2,
"firstName": "racks",
"lastName": "jacson",
"phoneNumber": "123456",
"emailAddress": "racks.jacson@abc.com"
}
]
{"emailAddress":"krish.lee@abc.com","firstName":"Krish","lastName":"Lee","phoneNumber":"123456","userId":1}
{"emailAddress":"racks.jacson@abc.com","firstName":"racks","lastName":"jacson","phoneNumber":"123456","userId":2}
下面是我的源代码
JSON到Avro
val df = spark.read.option("multiLine", true).json("src\\main\\resources\\user.json")
df.printSchema()
df.show()
//convert to avro
df.write.mode("append").format("com.databricks.spark.avro").save("src\\main\\resources\\user1")
AVRO到JSON
val jsonDF = spark.read
.format("com.databricks.spark.avro").load("src\\main\\resources\\user")
jsonDF.show()
jsonDF.printSchema()
jsonDF.write.mode(SaveMode.Overwrite).json("src\\main\\resources\\output\\json")
你能帮我检查一下下面的代码吗
scala> df
.select(to_json(collect_list(struct($"*"))).as("data"))
.write
.format("text") // You need to use text format, Using json will give you wrong data.
.mode("overwrite")
.save("/tmp/datab/")
输入数据
scala> import sys.process._
scala> "cat /root/spark-examples/data.json".!
[
{
"userId": 1,
"firstName": "Krish",
"lastName": "Lee",
"phoneNumber": "123456",
"emailAddress": "krish.lee@abc.com"
},
{
"userId": 2,
"firstName": "racks",
"lastName": "jacson",
"phoneNumber": "123456",
"emailAddress": "racks.jacson@abc.com"
}
]
将json文件内容加载到DataFrame
scala> val df = spark
.read
.option("multiline","true")
.json("/root/spark-examples/data.json")
df: org.apache.spark.sql.DataFrame = [emailAddress: string, firstName: string ... 3 more fields]
一旦json文件加载到DataFrame中,它将被转换为对象数组
到多个对象或行
,如下所示
scala> df.show(false)
+--------------------+---------+--------+-----------+------+
|emailAddress |firstName|lastName|phoneNumber|userId|
+--------------------+---------+--------+-----------+------+
|krish.lee@abc.com |Krish |Lee |123456 |1 |
|racks.jacson@abc.com|racks |jacson |123456 |2 |
+--------------------+---------+--------+-----------+------+
当您将DataFrame
写回时,它会将其写入多行
scala> df.repartition(1).write.mode("overwrite").json("/tmp/dataa/")
如果您想要与输入数据相同,请遵循以下代码
scala> df
.select(to_json(collect_list(struct($"*"))).as("data"))
.write
.format("text") // You need to use text format, Using json will give you wrong data.
.mode("overwrite")
.save("/tmp/datab/")
你的问题是什么?如果你看输入文件,它是一个列表,有多个对象。在输出文件中,我只得到一个对象,它不是一个列表。
scala> "cat /tmp/datab/part-00000-0896730e-51e1-4728-bd6b-cdfabc03978e-c000.txt".!
[
{"emailAddress":"krish.lee@abc.com","firstName":"Krish","lastName":"Lee","phoneNumber":"123456","userId":1},
{"emailAddress":"racks.jacson@abc.com","firstName":"racks","lastName":"jacson","phoneNumber":"123456","userId":2}
]