Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark-使用数组[String]进行分组,以匹配作为列表包含在其他记录元素中的记录_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala Spark-使用数组[String]进行分组,以匹配作为列表包含在其他记录元素中的记录

Scala Spark-使用数组[String]进行分组,以匹配作为列表包含在其他记录元素中的记录,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,在我的Scala程序中,我要处理的问题是如何组合来自多个级别GroupBy的结果。 我使用的数据集相当大。作为一个小示例,我有一个如下所示的数据帧: val df = (Seq(("f1", "l1", "loy1", null, "s1"), ("f1", "l1", "loy1", "e1", "s1"), ("f2", "l2", "loy2", "e2", "s2"), ("f2", "l2", "loy2", "e3", null), ("f1", "l1", null

在我的Scala程序中,我要处理的问题是如何组合来自多个级别GroupBy的结果。 我使用的数据集相当大。作为一个小示例,我有一个如下所示的数据帧:

val df = (Seq(("f1", "l1", "loy1", null, "s1"),
  ("f1", "l1", "loy1", "e1", "s1"),
  ("f2", "l2", "loy2", "e2", "s2"),
  ("f2", "l2", "loy2", "e3", null),
  ("f1", "l1", null , "e1", "s3"),
  ("f1", "l1", null , "e2", "s3"),
  ("f2", "l2", null , null, "s4")
).toDF("F", "L", "Loy", "Email", "State"))


+---+---+----+-----+-----+
|  F|  L| Loy|Email|State|
+---+---+----+-----+-----+
| f1| l1|loy1| null|   s1|
| f1| l1|loy1|   e1|   s1|
| f2| l2|loy2|   e2|   s2|
| f2| l2|loy2|   e3| null|
| f1| l1|null|   e1|   s3|
| f1| l1|null|   e2|   s3|
| f2| l2|null| null|   s4|
+---+---+----+-----+-----+
+---+---+----+--------+-----+
|  F|  L| Loy|   Email|State|
+---+---+----+--------+-----+
| f1| l1|null|[e1, e2]| [s3]|
| f2| l2|loy2|[e2, e3]| [s2]|
| f1| l1|loy1|    [e1]| [s1]|
| f2| l2|null|      []| [s4]|
+---+---+----+--------+-----+
对于第一级groupBy,我使用以下脚本,根据相同的(F、L、Loy)列获得结果:

df.groupBy("F", "L", "Loy").agg(collect_set($"Email").alias("Email"), collect_set($"State").alias("State")).show
结果如下:

val df = (Seq(("f1", "l1", "loy1", null, "s1"),
  ("f1", "l1", "loy1", "e1", "s1"),
  ("f2", "l2", "loy2", "e2", "s2"),
  ("f2", "l2", "loy2", "e3", null),
  ("f1", "l1", null , "e1", "s3"),
  ("f1", "l1", null , "e2", "s3"),
  ("f2", "l2", null , null, "s4")
).toDF("F", "L", "Loy", "Email", "State"))


+---+---+----+-----+-----+
|  F|  L| Loy|Email|State|
+---+---+----+-----+-----+
| f1| l1|loy1| null|   s1|
| f1| l1|loy1|   e1|   s1|
| f2| l2|loy2|   e2|   s2|
| f2| l2|loy2|   e3| null|
| f1| l1|null|   e1|   s3|
| f1| l1|null|   e2|   s3|
| f2| l2|null| null|   s4|
+---+---+----+-----+-----+
+---+---+----+--------+-----+
|  F|  L| Loy|   Email|State|
+---+---+----+--------+-----+
| f1| l1|null|[e1, e2]| [s3]|
| f2| l2|loy2|[e2, e3]| [s2]|
| f1| l1|loy1|    [e1]| [s1]|
| f2| l2|null|      []| [s4]|
+---+---+----+--------+-----+
我要处理的问题是如何执行二级groupBy,它基于条件(F,L,Email),并将FL作为字符串作为输入,而Email列作为数组[String]。此groupBy应返回如下结果:

+---+---+------+--------+---------+
|  F|  L|   Loy|   Email|    State|
+---+---+------+--------+---------+
| f1| l1|[loy1]|[e1, e2]| [s3, s1]|
| f2| l2|[loy2]|[e2, e3]|     [s2]|
| f2| l2|  null|      []|     [s4]|
+---+---+------+--------+---------+

主要目标是通过在不同级别应用groupBy,尽可能减少条目数量。我是Scala的新手,如果有任何帮助,我将不胜感激:)

可能重复的可能重复的可能重复的寻找除Graphframe以外的解决方案的可能重复。