Apache spark spark UDF在阵列上运行_Apache Spark_Apache Spark Sql_Spark Dataframe

Apache spark spark UDF在阵列上运行

apache-spark

Apache spark spark UDF在阵列上运行,apache-spark,apache-spark-sql,spark-dataframe,Apache Spark,Apache Spark Sql,Spark Dataframe,我有一个spark数据框，如： +-------------+------------------------------------------+ |a |destination | +-------------+------------------------------------------+ |[a,Alice,1] |[[b,Bob,0], [e,Esther,0], [h,Fraudster,1]]

我有一个spark数据框，如：

+-------------+------------------------------------------+
|a            |destination                               |
+-------------+------------------------------------------+
|[a,Alice,1]  |[[b,Bob,0], [e,Esther,0], [h,Fraudster,1]]|
|[e,Esther,0] |[[f,Fanny,0], [d,David,0]]                |
|[c,Charlie,0]|[[b,Bob,0]]                               |
|[b,Bob,0]    |[[c,Charlie,0]]                           |
|[f,Fanny,0]  |[[c,Charlie,0], [h,Fraudster,1]]          |
|[d,David,0]  |[[a,Alice,1], [e,Esther,0]]               |
+-------------+------------------------------------------+

具有

|-- destination: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- var_only_0_and_1: integer (nullable = false)

如何构造一个在列

destination

上运行的UDF，即由spark的

collect\u list

UDF创建的包装数组，以计算变量

var\u only\u 0\u和\u 1

的平均值？

您可以使用本机spark sql函数进行此操作

df.withColumn("dest",explode(col("destination")).
groupBy("a").agg(avg(col("dest").getField("var_only_0_and_1")))

只要UDF的方法签名正确（这在过去对我打击很大），就可以直接对数组进行操作。数组列将以Seq的形式对UDF可见，而结构将以行的形式对UDF可见，因此您需要这样的内容：

def test (in:Seq[Row]): String = {
  // return a named field from the second struct in the array
  in(2).getAs[String]("var_only_0_and_1")
}

var udftest = udf(test _)

我已经在看起来像你的数据上测试过了。我猜它可能会在Seq[Row]的字段上进行迭代，以实现您想要的结果

老实说，我对这样做的类型安全性一点也不确定，而且我相信按照@ayplam的说法，爆炸是最好的方式。内置函数通常比开发人员提供的任何自定义函数都要快，因为Spark无法优化自定义函数。

但explode看起来效率不高。有没有一种直接在阵列上操作的方法？为什么要投反对票？我知道explode，但我更喜欢这样的解决方案，因为它会破坏Tungstens的优化