Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Spark Scala中将多列分解为单独的行_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

在Spark Scala中将多列分解为单独的行

在Spark Scala中将多列分解为单独的行,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个DF在下面的结构中 Col1. Col2 Col3 Data1Col1,Data2Col1. Data1Col2,Data2Col2. Data1Col3,Data2Col3 我希望结果数据集为以下类型: Col1 Col2 Col3 Data1Col1. Data1Col2. Data1Col3 Data2Col1. Data2Col2 Data

我有一个DF在下面的结构中

Col1.                       Col2                    Col3
Data1Col1,Data2Col1.     Data1Col2,Data2Col2.    Data1Col3,Data2Col3
我希望结果数据集为以下类型:

Col1         Col2        Col3
Data1Col1.  Data1Col2.   Data1Col3
Data2Col1.  Data2Col2    Data2Col3
请建议我如何处理这个问题。我尝试过分解,但结果是重复的行

val df = Seq(("C,D,E,F","M,N,O,P","K,P,B,P")).toDF("Col1","Col2","Col3") 
   
df.show
+-------+-------+-------+
|   Col1|   Col2|   Col3|
+-------+-------+-------+
|C,D,E,F|M,N,O,P|K,P,B,P|
+-------+-------+-------+
           
val res1 = df.withColumn("Col1",split(col("Col1"),",")).withColumn("Col2",split(col("Col2"),",")).withColumn("Col3",split(col("Col3"),","))
           
res1.show
+------------+------------+------------+
|        Col1|        Col2|        Col3|
+------------+------------+------------+
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|
+------------+------------+------------+
           
           
val zip = udf((x: Seq[String], y: Seq[String], z: Seq[String]) => z.zip(x.zip(y)))
           
val res14 = res1.withColumn("test",explode(zip(col("Col1"),col("Col2"),col("Col3")))).show
+------------+------------+------------+-----------+
|        Col1|        Col2|        Col3|       test|
+------------+------------+------------+-----------+
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[K, [C, M]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[P, [D, N]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[B, [E, O]]|
|[C, D, E, F]|[M, N, O, P]|[K, P, B, P]|[P, [F, P]]|
+------------+------------+------------+-----------+
           
       
res14.withColumn("t3",col("test._1")).withColumn("tn",col("test._2")).withColumn("t2",col("tn._2")).withColumn("t1",col("tn._1")).select("t1","t2","t3").show
+---+---+---+
| t1| t2| t3|
+---+---+---+
|  C|  M|  K|
|  D|  N|  P|
|  E|  O|  B|
|  F|  P|  P|
+---+---+---+
res1-初始数据帧


res14-中间Df

为什么会有一些伪点?它们相关吗?