Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/sorting/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何使用groupBy将行收集到地图中?_Apache Spark_Apache Spark Sql - Fatal编程技术网

Apache spark 如何使用groupBy将行收集到地图中?

Apache spark 如何使用groupBy将行收集到地图中?,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,上下文 sqlContext.sql(s""" SELECT school_name, name, age FROM my_table """) 询问 根据上表,我想按学校名称分组,并将姓名、年龄收集到Map[String,Int] 例如-伪代码 val df = sqlContext.sql(s""" SELECT school_name, age FROM my_table GROUP BY school_name """) ------------------------ schoo

上下文

sqlContext.sql(s"""
SELECT
school_name,
name,
age
FROM my_table
""")
询问

根据上表,我想按学校名称分组,并将姓名、年龄收集到
Map[String,Int]

例如-伪代码

val df = sqlContext.sql(s"""
SELECT
school_name,
age
FROM my_table
GROUP BY school_name
""")


------------------------
school_name | name  | age
------------------------
school A | "michael"| 7 
school A | "emily"  | 5
school B | "cathy"  | 10
school B | "shaun"  | 5


df.groupBy("school_name").agg(make_map)

------------------------------------
school_name | map
------------------------------------
school A    | {"michael": 7, "emily": 5}
school B    | {"cathy": 10, "shaun": 5}

以下内容适用于Spark 2.0。您可以使用自2.0版本以来可用的函数来获取列作为映射

val df1 = df.groupBy(col("school_name")).agg(collect_list(map($"name",$"age")) as "map")
df1.show(false)
这将为您提供以下输出

+-----------+------------------------------------+
|school_name|map                                 |
+-----------+------------------------------------+
|school B   |[Map(cathy -> 10), Map(shaun -> 5)] |
|school A   |[Map(michael -> 7), Map(emily -> 5)]|
+-----------+------------------------------------+
现在,您可以使用
UDF
将各个贴图连接到单个贴图中,如下所示

import org.apache.spark.sql.functions.udf
val joinMap = udf { values: Seq[Map[String,Int]] => values.flatten.toMap }

val df2 = df1.withColumn("map", joinMap(col("map")))
df2.show(false)
这将使用
Map[String,Int]
提供所需的输出

+-----------+-----------------------------+
|school_name|map                          |
+-----------+-----------------------------+
|school B   |Map(cathy -> 10, shaun -> 5) |
|school A   |Map(michael -> 7, emily -> 5)|
+-----------+-----------------------------+
如果要将列值转换为JSON字符串,则Spark 2.1.0引入了函数

val df3 = df2.withColumn("map",to_json(struct($"map")))
df3.show(false)
to_json
函数将返回以下输出

+-----------+-------------------------------+
|school_name|map                            |
+-----------+-------------------------------+
|school B   |{"map":{"cathy":10,"shaun":5}} |
|school A   |{"map":{"michael":7,"emily":5}}|
+-----------+-------------------------------+

从spark 2.4开始,您可以使用函数来实现这一点

val df = spark.sql(s"""
    SELECT *
    FROM VALUES ('s1','a',1),('s1','b',2),('s2','a',1)
    AS (school, name, age)
""")

val df2 = df.groupBy("school").agg(map_from_arrays(collect_list($"name"), collect_list($"age")).as("map"))



+------+----+---+
|school|name|age|
+------+----+---+
|    s1|   a|  1|
|    s1|   b|  2|
|    s2|   a|  1|
+------+----+---+

+------+----------------+
|school|             map|
+------+----------------+
|    s2|        [a -> 1]|
|    s1|[a -> 1, b -> 2]|
+------+----------------+

虽然这段代码可能会回答这个问题,但提供关于这段代码为什么和/或如何回答这个问题的额外上下文会提高其长期价值。这似乎不起作用。如果在结果集合上执行printSchema,则集合包含字符串数组。因此,你只需要得到字符串,而不是地图。您可以通过使用UDF将字符串数组转换为映射来解决这个问题。如何在java中实现
val df1=df.groupBy(col(“school\u name”)).agg(collect\u list(map($“name”),$“age”))作为“map”)
?这还不在pyspark中吗?
val df = spark.sql(s"""
    SELECT *
    FROM VALUES ('s1','a',1),('s1','b',2),('s2','a',1)
    AS (school, name, age)
""")

val df2 = df.groupBy("school").agg(map_from_arrays(collect_list($"name"), collect_list($"age")).as("map"))



+------+----+---+
|school|name|age|
+------+----+---+
|    s1|   a|  1|
|    s1|   b|  2|
|    s2|   a|  1|
+------+----+---+

+------+----------------+
|school|             map|
+------+----------------+
|    s2|        [a -> 1]|
|    s1|[a -> 1, b -> 2]|
+------+----------------+