Apache spark 如何将行压缩为一行?
环境:spark 2.4.5 来源:test.csv 目标:test.csv 如您所见,我希望将具有相同id和日期的行合并为一行 我的解决方案: 我已尝试使用arrays_zip函数来处理它,但失败:Apache spark 如何将行压缩为一行?,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,环境:spark 2.4.5 来源:test.csv 目标:test.csv 如您所见,我希望将具有相同id和日期的行合并为一行 我的解决方案: 我已尝试使用arrays_zip函数来处理它,但失败: val soruce = spark.read("/home/user/test.csv").csv.options("header", "true") spark.sql("SELECT id , date, arrays_z
val soruce = spark.read("/home/user/test.csv").csv.options("header", "true")
spark.sql("SELECT id , date, arrays_zip( collect_list(item1), collect_list(item2), collect_list(item3)) FROM source GROUP BY id,date").show(false)
+---+----+-------------------------------------------------------------------------+
|id |date|arrays_zip(collect_list(item1), collect_list(item2), collect_list(item3))|
+---+----+-------------------------------------------------------------------------+
|0 |1 |[[111, 222, 333]] |
|1 |1 |[[111, 222, 333]] |
+---+----+-------------------------------------------------------------------------+
也许我应该把这个数组分解成cols
如果您能给我一些建议,我将不胜感激。使用
展平
和数组
而不是数组,然后使用元素
函数从每个元素获取项目
2.使用groupBy和first(col,ignoreNulls=true)
df.groupBy(col("id"),col("date")).
agg(first(col("item1")).alias("item1"),first(col("item2"),true).alias("item2"),first(col("item3"),true).alias("item3")).
show()
//+---+----+-----+-----+-----+
//| id|date|item1|item2|item3|
//+---+----+-----+-----+-----+
//| 0| 1| 111| 222| 333|
//| 1| 1| 111| 222| 333|
//+---+----+-----+-----+-----+
df.createOrReplaceTempView("tmp")
//using first
spark.sql("select id,date,first(item1,true) as item1,first(item2,true) as item2,first(item3,true) as item3 from tmp group by id,date").show()
//using max
spark.sql("select id,date,max(item1) as item1,max(item2) as item2,max(item3) as item3 from tmp group by id,date").show()
//using flatten array
spark.sql("select id,date, element_at(tmp,1)item1, element_at(tmp,2)item2, element_at(tmp,3)item3 from (select id,date,flatten(array(collect_list(item1),collect_list(item2),collect_list(item3))) as tmp from tmp group by id,date)t").show()
//+---+----+-----+-----+-----+
//| id|date|item1|item2|item3|
//+---+----+-----+-----+-----+
//| 0| 1| 111| 222| 333|
//| 1| 1| 111| 222| 333|
//+---+----+-----+-----+-----+
val df = spark.read("/home/user/test.csv").csv.options("header", "true")
val df1=df.groupBy(col("id"),col("date")).agg(flatten(array(collect_list(col("item1")),collect_list(col("item2")),collect_list(col("item3")))).alias("it"))
val len=df1.agg(max(size(col("it")))).collect()(0)(0).toString.toInt
spark.range(len).collect().foldLeft(df1)((df,len) => df.withColumn(s"item${len+1}",col("it")(len))).
drop("it").
show()
//+---+----+-----+-----+-----+
//| id|date|item1|item2|item3|
//+---+----+-----+-----+-----+
//| 0| 1| 111| 222| 333|
//| 1| 1| 111| 222| 333|
//+---+----+-----+-----+-----+
SQL:
df.groupBy(col("id"),col("date")).
agg(first(col("item1")).alias("item1"),first(col("item2"),true).alias("item2"),first(col("item3"),true).alias("item3")).
show()
//+---+----+-----+-----+-----+
//| id|date|item1|item2|item3|
//+---+----+-----+-----+-----+
//| 0| 1| 111| 222| 333|
//| 1| 1| 111| 222| 333|
//+---+----+-----+-----+-----+
df.createOrReplaceTempView("tmp")
//using first
spark.sql("select id,date,first(item1,true) as item1,first(item2,true) as item2,first(item3,true) as item3 from tmp group by id,date").show()
//using max
spark.sql("select id,date,max(item1) as item1,max(item2) as item2,max(item3) as item3 from tmp group by id,date").show()
//using flatten array
spark.sql("select id,date, element_at(tmp,1)item1, element_at(tmp,2)item2, element_at(tmp,3)item3 from (select id,date,flatten(array(collect_list(item1),collect_list(item2),collect_list(item3))) as tmp from tmp group by id,date)t").show()
//+---+----+-----+-----+-----+
//| id|date|item1|item2|item3|
//+---+----+-----+-----+-----+
//| 0| 1| 111| 222| 333|
//| 1| 1| 111| 222| 333|
//+---+----+-----+-----+-----+
val df = spark.read("/home/user/test.csv").csv.options("header", "true")
val df1=df.groupBy(col("id"),col("date")).agg(flatten(array(collect_list(col("item1")),collect_list(col("item2")),collect_list(col("item3")))).alias("it"))
val len=df1.agg(max(size(col("it")))).collect()(0)(0).toString.toInt
spark.range(len).collect().foldLeft(df1)((df,len) => df.withColumn(s"item${len+1}",col("it")(len))).
drop("it").
show()
//+---+----+-----+-----+-----+
//| id|date|item1|item2|item3|
//+---+----+-----+-----+-----+
//| 0| 1| 111| 222| 333|
//| 1| 1| 111| 222| 333|
//+---+----+-----+-----+-----+
动态方式:
df.groupBy(col("id"),col("date")).
agg(first(col("item1")).alias("item1"),first(col("item2"),true).alias("item2"),first(col("item3"),true).alias("item3")).
show()
//+---+----+-----+-----+-----+
//| id|date|item1|item2|item3|
//+---+----+-----+-----+-----+
//| 0| 1| 111| 222| 333|
//| 1| 1| 111| 222| 333|
//+---+----+-----+-----+-----+
df.createOrReplaceTempView("tmp")
//using first
spark.sql("select id,date,first(item1,true) as item1,first(item2,true) as item2,first(item3,true) as item3 from tmp group by id,date").show()
//using max
spark.sql("select id,date,max(item1) as item1,max(item2) as item2,max(item3) as item3 from tmp group by id,date").show()
//using flatten array
spark.sql("select id,date, element_at(tmp,1)item1, element_at(tmp,2)item2, element_at(tmp,3)item3 from (select id,date,flatten(array(collect_list(item1),collect_list(item2),collect_list(item3))) as tmp from tmp group by id,date)t").show()
//+---+----+-----+-----+-----+
//| id|date|item1|item2|item3|
//+---+----+-----+-----+-----+
//| 0| 1| 111| 222| 333|
//| 1| 1| 111| 222| 333|
//+---+----+-----+-----+-----+
val df = spark.read("/home/user/test.csv").csv.options("header", "true")
val df1=df.groupBy(col("id"),col("date")).agg(flatten(array(collect_list(col("item1")),collect_list(col("item2")),collect_list(col("item3")))).alias("it"))
val len=df1.agg(max(size(col("it")))).collect()(0)(0).toString.toInt
spark.range(len).collect().foldLeft(df1)((df,len) => df.withColumn(s"item${len+1}",col("it")(len))).
drop("it").
show()
//+---+----+-----+-----+-----+
//| id|date|item1|item2|item3|
//+---+----+-----+-----+-----+
//| 0| 1| 111| 222| 333|
//| 1| 1| 111| 222| 333|
//+---+----+-----+-----+-----+
谢谢你详细的回答!