Apache spark 需要使用dataframe pivot转换spark dataframe的帮助吗_Apache Spark_Apache Spark Sql_Pivot

Apache spark 需要使用dataframe pivot转换spark dataframe的帮助吗

apache-spark

Apache spark 需要使用dataframe pivot转换spark dataframe的帮助吗,apache-spark,apache-spark-sql,pivot,Apache Spark,Apache Spark Sql,Pivot,我有下面的spark数据框，我只需要在name列中透视（Histogram.ratio和Histogram.abs） Instance name Abs_value Ratio_value A37 Histogram.ratio.1 0.70 Null A37 Histogram.abs.1 20 Null A37

我有下面的spark数据框，我只需要在

name

列中透视（

Histogram.ratio

和

Histogram.abs

）

 Instance   name               Abs_value    Ratio_value  
 A37        Histogram.ratio.1    0.70         Null           
 A37        Histogram.abs.1      20           Null           
 A37        Histogram.ratio.2    0.50         Null           
 A37        Histogram.abs.2      15           Null           
 A37        Mean                 20           Null           
 A37        Min                  3            Null           
 A37        Missingratio         Null           3

预期产出：

 Instance   name               Abs_value    Ratio_value  
 A37        Histogram.1          20           0.70                  
 A37        Histogram.2          15           0.50           
 A37        Mean                 20           Null           
 A37        Min                  3            Null           
 A37        Missingratio         Null           3

我尝试将dataframe.pivot与过滤

name

列一起使用，但没有按预期工作。需要帮助

您可以进行一些预处理，以（1）基于

创建一个新列type
，如果Ratio\u值为NULL

和

如果name包含.Ratio.

，（2）从

name

列中删除

\.（Ratio\abs）\.

，（3）使用合并函数组合

abs\u值和Ratio\u值列，然后执行常规透视处理：
import org.apache.spark.sql.functions.{regexp_replace,coalesce,expr,first}

val df_new = (df.select(
      'Instance,
      regexp_replace('name, "[.](?:ratio|abs)[.]", ".") as 'name,
      coalesce('Ratio_value, 'Abs_value) as 'value,
      expr("IF(instr(name,'.ratio.') > 0 OR Ratio_value is NOT NULL, 'Ratio_value','Abs_value') as type")
  )
  .groupBy('Instance, 'name)
  .pivot('type, Seq("Abs_value","Ratio_value"))
  .agg(first('value)))

df_new.show
+--------+------------+---------+-----------+                                   
|Instance|        name|Abs_value|Ratio_value|
+--------+------------+---------+-----------+
|     A37| Histogram.1|       20|       0.70|
|     A37|         Min|        3|       null|
|     A37|        Mean|       20|       null|
|     A37| Histogram.2|       15|       0.50|
|     A37|Missingratio|     null|          3|
+--------+------------+---------+-----------+

方法2:如果包含.ratio.
或.abs.
的行数较小，请使用pivot单独处理它们，然后合并其余行
import org.apache.spark.sql.functions.{regexp_replace,when,first}

val cond = 'name.contains(".ratio.") || 'name.contains(".abs.")
val df1 = (df.filter(cond)
    .select(
      'Instance, 
      regexp_replace('name, "[.](ratio|abs)[.]", ".") as 'name, 
      when('name.contains(".ratio."),"Ratio_value").otherwise("Abs_value") as 'type,
      'Abs_value as 'value)
    .groupBy('Instance, 'name)
    .pivot('type, Seq("Abs_value", "Ratio_value"))
    .agg(first('value)))

val df_new = df.filter(!cond).union(df1)

方法3:根据名称中是否存在子字符串.ratio.
将原始数据帧一分为二，然后执行完全外部联接：
import org.apache.spark.sql.functions.{regexp_replace,coalesce}
import org.apache.spark.sql.Column

val adjust_name = (c:Column) => regexp_replace(c, "[.](ratio|abs)[.]", ".") 
val cond = 'name.contains(".ratio.")

val df1 = df.filter(!cond).withColumn("name", adjust_name('name))
+--------+------------+---------+-----------+
|Instance|        name|Abs_value|Ratio_value|
+--------+------------+---------+-----------+
|     A37| Histogram.1|       20|       null|
|     A37| Histogram.2|       15|       null|
|     A37|        Mean|       20|       null|
|     A37|         Min|        3|       null|
|     A37|Missingratio|     null|          3|
+--------+------------+---------+-----------+

val df2 = df.filter(cond).select('Instance, adjust_name('name) as 'name, 'Abs_value as 'Ratio_value1)
+--------+-----------+------------+
|Instance|       name|Ratio_value1|
+--------+-----------+------------+
|     A37|Histogram.1|        0.70|
|     A37|Histogram.2|        0.50|
+--------+-----------+------------+

val df_new = (df1.join(df2, Seq("Instance","name"),"full")
    .select('Instance, 'name, 'Abs_value, coalesce('Ratio_value, 'Ratio_value1) as 'Ratio_value))