Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/visual-studio-code/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 需要使用dataframe pivot转换spark dataframe的帮助吗_Apache Spark_Apache Spark Sql_Pivot - Fatal编程技术网

Apache spark 需要使用dataframe pivot转换spark dataframe的帮助吗

Apache spark 需要使用dataframe pivot转换spark dataframe的帮助吗,apache-spark,apache-spark-sql,pivot,Apache Spark,Apache Spark Sql,Pivot,我有下面的spark数据框,我只需要在name列中透视(Histogram.ratio和Histogram.abs) Instance name Abs_value Ratio_value A37 Histogram.ratio.1 0.70 Null A37 Histogram.abs.1 20 Null A37

我有下面的spark数据框,我只需要在
name
列中透视(
Histogram.ratio
Histogram.abs

 Instance   name               Abs_value    Ratio_value  
 A37        Histogram.ratio.1    0.70         Null           
 A37        Histogram.abs.1      20           Null           
 A37        Histogram.ratio.2    0.50         Null           
 A37        Histogram.abs.2      15           Null           
 A37        Mean                 20           Null           
 A37        Min                  3            Null           
 A37        Missingratio         Null           3
预期产出:

 Instance   name               Abs_value    Ratio_value  
 A37        Histogram.1          20           0.70                  
 A37        Histogram.2          15           0.50           
 A37        Mean                 20           Null           
 A37        Min                  3            Null           
 A37        Missingratio         Null           3

我尝试将dataframe.pivot与过滤
name
列一起使用,但没有按预期工作。需要帮助

您可以进行一些预处理,以(1)基于
创建一个新列
type
,如果Ratio\u值为NULL
如果name包含.Ratio.
,(2)从
name
列中删除
\.(Ratio\abs)\.
,(3)使用合并函数组合
abs\u值
Ratio\u值
列,然后执行常规透视处理:

import org.apache.spark.sql.functions.{regexp_replace,coalesce,expr,first}

val df_new = (df.select(
      'Instance,
      regexp_replace('name, "[.](?:ratio|abs)[.]", ".") as 'name,
      coalesce('Ratio_value, 'Abs_value) as 'value,
      expr("IF(instr(name,'.ratio.') > 0 OR Ratio_value is NOT NULL, 'Ratio_value','Abs_value') as type")
  )
  .groupBy('Instance, 'name)
  .pivot('type, Seq("Abs_value","Ratio_value"))
  .agg(first('value)))

df_new.show
+--------+------------+---------+-----------+                                   
|Instance|        name|Abs_value|Ratio_value|
+--------+------------+---------+-----------+
|     A37| Histogram.1|       20|       0.70|
|     A37|         Min|        3|       null|
|     A37|        Mean|       20|       null|
|     A37| Histogram.2|       15|       0.50|
|     A37|Missingratio|     null|          3|
+--------+------------+---------+-----------+
方法2:如果包含
.ratio.
.abs.
的行数较小,请使用pivot单独处理它们,然后合并其余行

import org.apache.spark.sql.functions.{regexp_replace,when,first}

val cond = 'name.contains(".ratio.") || 'name.contains(".abs.")
val df1 = (df.filter(cond)
    .select(
      'Instance, 
      regexp_replace('name, "[.](ratio|abs)[.]", ".") as 'name, 
      when('name.contains(".ratio."),"Ratio_value").otherwise("Abs_value") as 'type,
      'Abs_value as 'value)
    .groupBy('Instance, 'name)
    .pivot('type, Seq("Abs_value", "Ratio_value"))
    .agg(first('value)))

val df_new = df.filter(!cond).union(df1)
方法3:根据名称中是否存在子字符串
.ratio.
将原始数据帧一分为二,然后执行完全外部联接:

import org.apache.spark.sql.functions.{regexp_replace,coalesce}
import org.apache.spark.sql.Column

val adjust_name = (c:Column) => regexp_replace(c, "[.](ratio|abs)[.]", ".") 
val cond = 'name.contains(".ratio.")

val df1 = df.filter(!cond).withColumn("name", adjust_name('name))
+--------+------------+---------+-----------+
|Instance|        name|Abs_value|Ratio_value|
+--------+------------+---------+-----------+
|     A37| Histogram.1|       20|       null|
|     A37| Histogram.2|       15|       null|
|     A37|        Mean|       20|       null|
|     A37|         Min|        3|       null|
|     A37|Missingratio|     null|          3|
+--------+------------+---------+-----------+

val df2 = df.filter(cond).select('Instance, adjust_name('name) as 'name, 'Abs_value as 'Ratio_value1)
+--------+-----------+------------+
|Instance|       name|Ratio_value1|
+--------+-----------+------------+
|     A37|Histogram.1|        0.70|
|     A37|Histogram.2|        0.50|
+--------+-----------+------------+

val df_new = (df1.join(df2, Seq("Instance","name"),"full")
    .select('Instance, 'name, 'Abs_value, coalesce('Ratio_value, 'Ratio_value1) as 'Ratio_value))