Apache spark 需要使用dataframe pivot转换spark dataframe的帮助吗
我有下面的spark数据框,我只需要在Apache spark 需要使用dataframe pivot转换spark dataframe的帮助吗,apache-spark,apache-spark-sql,pivot,Apache Spark,Apache Spark Sql,Pivot,我有下面的spark数据框,我只需要在name列中透视(Histogram.ratio和Histogram.abs) Instance name Abs_value Ratio_value A37 Histogram.ratio.1 0.70 Null A37 Histogram.abs.1 20 Null A37
name
列中透视(Histogram.ratio
和Histogram.abs
)
Instance name Abs_value Ratio_value
A37 Histogram.ratio.1 0.70 Null
A37 Histogram.abs.1 20 Null
A37 Histogram.ratio.2 0.50 Null
A37 Histogram.abs.2 15 Null
A37 Mean 20 Null
A37 Min 3 Null
A37 Missingratio Null 3
预期产出:
Instance name Abs_value Ratio_value
A37 Histogram.1 20 0.70
A37 Histogram.2 15 0.50
A37 Mean 20 Null
A37 Min 3 Null
A37 Missingratio Null 3
我尝试将dataframe.pivot与过滤
name
列一起使用,但没有按预期工作。需要帮助 您可以进行一些预处理,以(1)基于创建一个新列type
,如果Ratio\u值为NULL
和如果name包含.Ratio.
,(2)从name
列中删除\.(Ratio\abs)\.
,(3)使用合并函数组合abs\u值和Ratio\u值列,然后执行常规透视处理:
import org.apache.spark.sql.functions.{regexp_replace,coalesce,expr,first}
val df_new = (df.select(
'Instance,
regexp_replace('name, "[.](?:ratio|abs)[.]", ".") as 'name,
coalesce('Ratio_value, 'Abs_value) as 'value,
expr("IF(instr(name,'.ratio.') > 0 OR Ratio_value is NOT NULL, 'Ratio_value','Abs_value') as type")
)
.groupBy('Instance, 'name)
.pivot('type, Seq("Abs_value","Ratio_value"))
.agg(first('value)))
df_new.show
+--------+------------+---------+-----------+
|Instance| name|Abs_value|Ratio_value|
+--------+------------+---------+-----------+
| A37| Histogram.1| 20| 0.70|
| A37| Min| 3| null|
| A37| Mean| 20| null|
| A37| Histogram.2| 15| 0.50|
| A37|Missingratio| null| 3|
+--------+------------+---------+-----------+
方法2:如果包含.ratio.
或.abs.
的行数较小,请使用pivot单独处理它们,然后合并其余行
import org.apache.spark.sql.functions.{regexp_replace,when,first}
val cond = 'name.contains(".ratio.") || 'name.contains(".abs.")
val df1 = (df.filter(cond)
.select(
'Instance,
regexp_replace('name, "[.](ratio|abs)[.]", ".") as 'name,
when('name.contains(".ratio."),"Ratio_value").otherwise("Abs_value") as 'type,
'Abs_value as 'value)
.groupBy('Instance, 'name)
.pivot('type, Seq("Abs_value", "Ratio_value"))
.agg(first('value)))
val df_new = df.filter(!cond).union(df1)
方法3:根据名称中是否存在子字符串.ratio.
将原始数据帧一分为二,然后执行完全外部联接:
import org.apache.spark.sql.functions.{regexp_replace,coalesce}
import org.apache.spark.sql.Column
val adjust_name = (c:Column) => regexp_replace(c, "[.](ratio|abs)[.]", ".")
val cond = 'name.contains(".ratio.")
val df1 = df.filter(!cond).withColumn("name", adjust_name('name))
+--------+------------+---------+-----------+
|Instance| name|Abs_value|Ratio_value|
+--------+------------+---------+-----------+
| A37| Histogram.1| 20| null|
| A37| Histogram.2| 15| null|
| A37| Mean| 20| null|
| A37| Min| 3| null|
| A37|Missingratio| null| 3|
+--------+------------+---------+-----------+
val df2 = df.filter(cond).select('Instance, adjust_name('name) as 'name, 'Abs_value as 'Ratio_value1)
+--------+-----------+------------+
|Instance| name|Ratio_value1|
+--------+-----------+------------+
| A37|Histogram.1| 0.70|
| A37|Histogram.2| 0.50|
+--------+-----------+------------+
val df_new = (df1.join(df2, Seq("Instance","name"),"full")
.select('Instance, 'name, 'Abs_value, coalesce('Ratio_value, 'Ratio_value1) as 'Ratio_value))