Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/apache/9.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在ApacheSpark中,如何合并一个分解的JSON数组中的多个SQL列?_Json_Apache Spark_Dataframe_Apache Spark Sql_Spark Dataframe - Fatal编程技术网

在ApacheSpark中,如何合并一个分解的JSON数组中的多个SQL列?

在ApacheSpark中,如何合并一个分解的JSON数组中的多个SQL列?,json,apache-spark,dataframe,apache-spark-sql,spark-dataframe,Json,Apache Spark,Dataframe,Apache Spark Sql,Spark Dataframe,我正在从一个目录中读取多个JSON文件;此JSON在一个数组中有多个项目“cars”。我正在尝试将“car”项中的离散值分解并合并到一个数据帧中 JSON文件如下所示: { "cars": { "items": [ { "latitude": 42.0001, "longitude": 19.0001,

我正在从一个目录中读取多个JSON文件;此JSON在一个数组中有多个项目“cars”。我正在尝试将“car”项中的离散值分解并合并到一个数据帧中

JSON文件如下所示:

{
    "cars": {
        "items": 
            [
                {

                    "latitude": 42.0001,
                    "longitude": 19.0001,
                    "name": "Alex"
                },
                {

                    "latitude": 42.0002,
                    "longitude": 19.0002,
                    "name": "Berta"
                },
                {

                    "latitude": 42.0003,
                    "longitude": 19.0003,
                    "name": "Chris"
                },
                {

                    "latitude": 42.0004,
                    "longitude": 19.0004,
                    "name": "Diana"
                }
            ]
    }
}
我将值分解并合并到一个数据帧的方法如下:

// Read JSON files
val jsonData = sqlContext.read.json(s"/mnt/$MountName/.")
// To sqlContext to DataFrame
val jsonDF = jsonData.toDF()

/* Approach 1 */
// User-defined function to 'zip' two columns
val zip = udf((xs: Seq[Double], ys: Seq[Double]) => xs.zip(ys))
jsonDF.withColumn("vars", explode(zip($"cars.items.latitude", $"cars.items.longitude"))).select($"cars.items.name", $"vars._1".alias("varA"), $"vars._2".alias("varB"))

/* Apporach 2 */
val df = jsonData.select($"cars.items.name", $"cars.items.latitude", $"cars.items.longitude").toDF("name", "latitude", "longitude")
val df1 = df.select(explode(df("name")).alias("name"), df("latitude"), df("longitude"))
val df2 = df1.select(df1("name").alias("name"), explode(df1("latitude")).alias("latitude"), df1("longitude"))
val df3 = df2.select(df2("name"), df2("latitude"), explode(df2("longitude")).alias("longitude"))
正如您可能看到的那样,方法1的结果只是两个离散“合并”参数的数据帧,如:

+--------------------+---------+---------+
|                name|     varA|     varB|
+--------------------+---------+---------+
|[Leo, Britta, Gor...|48.161079|11.556778|
|[Leo, Britta, Gor...|48.124666|11.617682|
|[Leo, Britta, Gor...|48.352043|11.788091|
|[Leo, Britta, Gor...| 48.25184|11.636337|
该方法的结果如下:

+----+---------+---------+
|name| latitude|longitude|
+----+---------+---------+
| Leo|48.161079|11.556778|
| Leo|48.161079|11.617682|
| Leo|48.161079|11.788091|
| Leo|48.161079|11.636337|
| Leo|48.161079|11.560595|
| Leo|48.161079|11.788632|
+--------------------+---------+---------+
|                name|     varA|     varB|
+--------------------+---------+---------+
|Leo                 |48.161079|11.556778|
|Britta              |48.124666|11.617682|
|Gorch               |48.352043|11.788091|
(结果是每个“名称”与每个“纬度”和每个“经度”的映射)

结果如下:

+----+---------+---------+
|name| latitude|longitude|
+----+---------+---------+
| Leo|48.161079|11.556778|
| Leo|48.161079|11.617682|
| Leo|48.161079|11.788091|
| Leo|48.161079|11.636337|
| Leo|48.161079|11.560595|
| Leo|48.161079|11.788632|
+--------------------+---------+---------+
|                name|     varA|     varB|
+--------------------+---------+---------+
|Leo                 |48.161079|11.556778|
|Britta              |48.124666|11.617682|
|Gorch               |48.352043|11.788091|
你知道如何读取文件,分割和合并每一行只是一个对象的值吗


非常感谢你的帮助

为了获得预期结果,您可以尝试以下方法:

// Read JSON files
val jsonData = sqlContext.read.json(s"/mnt/$MountName/.")
// To sqlContext to DataFrame
val jsonDF = jsonData.toDF()

// Approach
val df1 = jsonDF.select(explode(df("cars.items")).alias("items"))
val df2 = df1.select("items.name", "items.latitude", "items.longitude")
上述方法将为您提供以下结果:

+-----+--------+---------+
| name|latitude|longitude|
+-----+--------+---------+
| Alex| 42.0001|  19.0001|
|Berta| 42.0002|  19.0002|
|Chris| 42.0003|  19.0003|
|Diana| 42.0004|  19.0004|
+-----+--------+---------+

为了获得预期结果,您可以尝试以下方法:

// Read JSON files
val jsonData = sqlContext.read.json(s"/mnt/$MountName/.")
// To sqlContext to DataFrame
val jsonDF = jsonData.toDF()

// Approach
val df1 = jsonDF.select(explode(df("cars.items")).alias("items"))
val df2 = df1.select("items.name", "items.latitude", "items.longitude")
上述方法将为您提供以下结果:

+-----+--------+---------+
| name|latitude|longitude|
+-----+--------+---------+
| Alex| 42.0001|  19.0001|
|Berta| 42.0002|  19.0002|
|Chris| 42.0003|  19.0003|
|Diana| 42.0004|  19.0004|
+-----+--------+---------+