Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Dataframe 未聚合的组中的Spark Pivot_Dataframe_Apache Spark_Pyspark_Etl_Rdd - Fatal编程技术网

Dataframe 未聚合的组中的Spark Pivot

Dataframe 未聚合的组中的Spark Pivot,dataframe,apache-spark,pyspark,etl,rdd,Dataframe,Apache Spark,Pyspark,Etl,Rdd,我有这个数据框: 身份证件 名称 q1_w q1_x 第一季度 q1_z q2_w q2_x 第2季度 q2_z 1. AAAA 瓦尔1 瓦尔2 val3 瓦尔4 山谷 瓦尔克斯 谷 瓦尔兹 2. BBBB del1 del2 del3 del4 德鲁 德尔克斯 德利 德尔兹 3. 中交 sol1 sol2 sol3 sol4 无效的 无效的 无效的 无效的 如果源数据帧中的列数保持不变或相同,则只需执行两个单独的转换,即选择列并重命名,然后合并两个数据帧,即可获得所需的输出 //source

我有这个数据框:

身份证件 名称 q1_w q1_x 第一季度 q1_z q2_w q2_x 第2季度 q2_z 1. AAAA 瓦尔1 瓦尔2 val3 瓦尔4 山谷 瓦尔克斯 谷 瓦尔兹 2. BBBB del1 del2 del3 del4 德鲁 德尔克斯 德利 德尔兹 3. 中交 sol1 sol2 sol3 sol4 无效的 无效的 无效的 无效的
如果源数据帧中的列数保持不变或相同,则只需执行两个单独的转换,即选择列并重命名,然后合并两个数据帧,即可获得所需的输出

//source Data creation
val df = Seq((1,"AAAA","val1","val2","val3","val4","valw","valx","valy","valz"),(2,"BBBB","del1","del2","del3","del4","delw","delx","dely","delz"),(3,"CCCC","sol1","sol2","sol3","sol4",null,null,null,null)).toDF("id","name","q1_w","q1_x","q1_y","q1_z","q2_w","q2_x","q2_y","q2_z")
//creating first dataframe  with required columns and renaming them
val df1 = df.select("id","name","q1_w","q1_x","q1_y","q1_z").filter($"q1_w".isNotNull).filter($"q1_x".isNotNull).filter($"q1_y".isNotNull).filter($"q1_z".isNotNull).withColumnRenamed("q1_w","w").withColumnRenamed("q1_x","x").withColumnRenamed("q1_y","y").withColumnRenamed("q1_z","z")
//creating second dataframe  with required columns and renaming them
val df2 = df.select("id","name","q2_w","q2_x","q2_y","q2_z").filter($"q2_w".isNotNull).filter($"q2_x".isNotNull).filter($"q2_y".isNotNull).filter($"q2_z".isNotNull).withColumnRenamed("q2_w","w").withColumnRenamed("q2_x","x").withColumnRenamed("q2_y","y").withColumnRenamed("q2_z","z")
//union first and the second dataframe and you would get the output that is required.
val finaldf = df1.union(df2)
您可以看到如下输出:

您可以使用
melt
的概念,然后拆分
\u
和pivot上的列(注意pivot可能有点贵):


id_vars = ['id','name']
value_vars = [i for i in df.columns if i not in id_vars]
value_name = "Val"
var_name='Var'
_vars_and_vals = F.array(*(
        F.struct(F.lit(c).alias(var_name), F.col(c).alias(value_name)) 
        for c in value_vars))

    # Add to the DataFrame and explode
df1 = df.withColumn("_vars_and_vals", F.explode(_vars_and_vals))
cols = ['id','name'] + [
            F.col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
split_var = F.split("Var","_")
out = (df1.select(*cols).withColumn("NewVar",split_var[1])
 .groupby(id_vars+[split_var[0].alias("q")]).pivot("NewVar").agg(F.first("Val")))
out.show()

+---+----+---+----+----+----+----+
| id|name|  q|   w|   x|   y|   z|
+---+----+---+----+----+----+----+
|  1|AAAA| q1|val1|val2|val3|val4|
|  1|AAAA| q2|valw|valx|valy|valz|
|  2|BBBB| q1|del1|del2|del3|del4|
|  2|BBBB| q2|delw|delx|dely|delz|
|  3|CCCC| q1|sol1|sol2|sol3|sol4|
|  3|CCCC| q2|null|null|null|null|
+---+----+---+----+----+----+----+