Apache spark 分解多列SparkSQL
如何在Spark中分解多个数组列?我有一个带有5列字符串化数组的数据帧,我想在所有5列上分解。为了简单起见,使用3列显示示例 如果我有以下输入行:Apache spark 分解多列SparkSQL,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,如何在Spark中分解多个数组列?我有一个带有5列字符串化数组的数据帧,我想在所有5列上分解。为了简单起见,使用3列显示示例 如果我有以下输入行: col1 col2 col3 ["b_val1","b_val2"] ["at_val1","at_val2","at_val3"] ["ma
col1 col2 col3
["b_val1","b_val2"] ["at_val1","at_val2","at_val3"] ["male","female"]
我想在所有3个数组列上分解,因此输出应该如下所示:
b_val1 at_val1 male
b_val1 at_val1 female
b_val2 at_val1 male
b_val2 at_val1 female
b_val1 at_val2 male
b_val1 at_val2 female
b_val2 at_val2 male
b_val2 at_val2 female
b_val1 at_val3 male
b_val1 at_val3 female
b_val2 at_val3 male
b_val2 at_val3 female
我尝试了以下方法:
SELECT
timestamp,
explode(from_json(brandList, 'array<string>')) AS brand,
explode(from_json(articleTypeList, 'array<string>')) AS articleTypeList,
explode(from_json(gender, 'array<string>')) AS gender,
explode(from_json(masterCategoryList, 'array<string>')) AS masterCategoryList,
explode(from_json(subCategoryList, 'array<string>')) AS subCategoryList,
isLandingPage,
...
from table
选择
时间戳,
分解(从_json(brandList,'array'))为品牌,
将(从_json(articleTypeList,'array'))分解为articleTypeList,
将(从_json(gender,'array'))分解为gender,
将(从_json(masterCategoryList,'array'))分解为masterCategoryList,
将(从_json(subCategoryList,'array'))分解为subCategoryList,
孤岛页,
...
从桌子上
但是这是不允许的,因为我在线程“main”org.apache.spark.sql.AnalysisException中得到了以下错误-
异常:每个select子句只允许一个生成器,但找到了5个:explode(JSOntStructs(brandList)),explode(JSOntStructs(articleTypeList)),explode(JSOntStructs(性别)),explode(JSOntStructs(masterCategoryList)),explode(jsontostructs(子类别列表));
使用withColumn获得所需的输出
让我们创建一个包含3列arraytype的示例数据帧,并执行分解操作:
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val rdd=spark.sparkContext.makeRDD(List(Row(Array(1,2,3),Array("a","b","c"),Array("1a","1b","1c"))))
val schema=new StructType().add("arraycolumn1",ArrayType(IntegerType)).add("arraycolumn2",ArrayType(StringType)).add("arraycolumn3",ArrayType(StringType))
var df=spark.createDataFrame(rdd,schema)
df.show(5,false)
+------------+------------+------------+
|arraycolumn1|arraycolumn2|arraycolumn3|
+------------+------------+------------+
|[1, 2, 3] |[a, b, c] |[1a, 1b, 1c]|
+------------+------------+------------+
val explodedDF=df.withColumn("column1",explode('arraycolumn1)).withColumn("column2",explode('arraycolumn2)).withColumn("column3",explode('arraycolumn3))
explodedDF.select('column1,'column2,'column3).show(5,false)
+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
|1 |a |1a |
|1 |a |1b |
|1 |a |1c |
|1 |b |1a |
|1 |b |1b |
+-------+-------+-------+
only showing top 5 rows
// Let's do the above steps with less lines of code
var exploded=df.columns.foldLeft(df)((df,column)=>df.withColumn(column,explode(col(column))))
exploded.select(df.columns.map(col(_)):_*).show(false)
//using spark-sql
df.createOrReplaceTempView("arrayTable")
spark.sql("""
select column1,column2,column3 from arraytable
LATERAL VIEW explode(arraycolumn1) as column1
LATERAL VIEW explode(arraycolumn2) as column2
LATERAL VIEW explode(arraycolumn3) as column3""").show
在SQL中有什么方法可以做到这一点吗?没有编写scala代码的灵活性。我们可以使用横向视图来完成,编辑文章,请看一看。