基于Scala中的键从映射中的数组获取值_Scala_Apache Spark_Apache Spark Sql_User Defined Functions

基于Scala中的键从映射中的数组获取值

scala apache-spark

基于Scala中的键从映射中的数组获取值,scala,apache-spark,apache-spark-sql,user-defined-functions,Scala,Apache Spark,Apache Spark Sql,User Defined Functions,我有一个具有以下模式的数据帧： |-- A: map (nullable = true) | |-- key: string | |-- value: array (valueContainsNull = true) | | |-- element: struct (containsNull = true) | | | |-- id: string (nullable = true) | | | |-- type: stri

我有一个具有以下模式的数据帧：

 |-- A: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- index: boolean (nullable = false)
 |-- idkey: string (nullable = true)

因为映射中的值是数组类型，所以我需要提取与“foreign”键字段idkey中的id对应的字段索引

例如，我有以下数据：

 {"A":{
 "innerkey_1":[{"id":"1","type":"0.01","index":true},
               {"id":"6","type":"4.3","index":false}]},
 "1"}

由于idkey是1，我们需要输出与元素对应的索引值，其中

“id”：1

，即索引应该等于true。我真的不知道我如何才能做到这一点，与UDF或其他

预期产出为：

+---------+
|索引输出|
+---------+
|真的|
+---------+

如果您的数据帧具有以下模式

root
 |-- A: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- types: string (nullable = true)
 |    |    |    |-- index: boolean (nullable = false)
 |-- idkey: string (nullable = true)

然后您可以使用两个分解函数，一个用于映射，另一个用于内部数组，使用过滤器过滤匹配项，最后选择索引

import org.apache.spark.sql.functions._
df.select(col("idkey"), explode(col("A")))
  .select(col("idkey"), explode(col("value")).as("value"))
  .filter(col("idkey") === col("value.id"))
  .select(col("value.index").as("indexout"))

你应该

+--------+
|indexout|
+--------+
|true    |
+--------+

使用自定义项功能

您可以通过使用

udf

函数来执行上述操作，该函数将避免两个

爆炸

和一个

过滤器

。所有的分解和过滤都是在udf函数本身中完成的。您可以根据需要进行修改

import org.apache.spark.sql.functions._
def indexoutUdf = udf((a: Map[String, Seq[Row]], idkey: String) => {
  a.map(x => x._2.filter(y => y.getAs[String](0) == idkey).map(y => y.getAs[Boolean](2))).toList(0).head
})
df.select(indexoutUdf(col("A"), col("idkey")).as("indexout")).show(false)

我希望答案是有帮助的

您能澄清一下

吗，即索引应该等于0

？？你能分享你的预期输出吗？1怎么可能是布尔值？而且类型struct似乎是double而不是string？？我已经纠正了错误，谢谢你指出。索引false的id为6。它们不匹配idkey和id。匹配的索引应该是true。这些不是

吗？既然idkey是1，我们需要输出与元素“id”：1对应的索引值，即索引应该等于false，相互矛盾？除了使用explode，还有其他方法吗？我考虑过，但对于大型数据帧来说，这太昂贵了。@PramodKumar，我已经更新了答案：）我希望这次的答案会被提升并接受；）