spark scala中结构数组时如何更新列值_Scala_Apache Spark

spark scala中结构数组时如何更新列值

scala apache-spark

spark scala中结构数组时如何更新列值,scala,apache-spark,Scala,Apache Spark,我只想知道，如果我有一列不想更新的列，是否可以将struct数组更新为某个值。例如如果我有一个列表[字符串]=列表（斑马，狗）是否可以将列的所有其他数组设置为0，例如大象和狮子将为0 root |-- _id: string (nullable = true) |-- h: string (nullable = true) |-- inc: string (nullable = true) |-- op: string (nullable = true) |-- ts: stri

我只想知道，如果我有一列不想更新的列，是否可以将struct数组更新为某个值。例如如果我有一个列表[字符串]=列表（斑马，狗）是否可以将列的所有其他数组设置为0，例如大象和狮子将为0

root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- Animal: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- Elephant: string (nullable = false)
 |    |    |-- Lion: string (nullable = true)
 |    |    |-- Zebra: string (nullable = true)
 |    |    |-- Dog: string (nullable = true)

我是一行一行地迭代就像我做了一个函数一样

+---+----+-----+------+-------+--------------------+
|_id|h   |inc  |op    |ts     |webhooks            |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 1, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 1, 1, 0]]|
+---+----+-----+------+-------+--------------------+
After operations It will be
+---+----+-----+------+-------+--------------------+
|_id|h   |inc  |op    |ts     |webhooks            |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[0, 0, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 0, 1, 0]]|
+---+----+-----+------+-------+--------------------+

但无法执行此操作

请检查下面的代码

def changeValue(row :Row) = {
//some code
}

构造表达式

scala> ddf.show(false)
+---+----+-----+------+-------+--------------------+
|_id|h   |inc  |op    |ts     |webhooks            |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 11, 111, 1111]]|
|fb1|fb11|fb111|fb1111|fb11111|[[2, 22, 222, 2222]]|
+---+----+-----+------+-------+--------------------+


scala> val columnsTobeUpdatedInWebhooks = Seq("zebra","dog") // Columns to be updated in webhooks.
columnsTobeUpdatedInWebhooks: Seq[String] = List(zebra, dog)

val expr = flatten(
    array(
        ddf
        .select(explode($"webhooks").as("webhooks"))
        .select("webhooks.*")
        .columns
        .map(c => if(columnsTobeUpdatedInWebhooks.contains(c)) col(s"webhooks.${c}").as(c) else array(lit(0)).as(c)):_*
    )
)

expr: org.apache.spark.sql.Column = flatten(array(array(0) AS `elephant`, array(0) AS `lion`, webhooks.zebra AS `zebra`, webhooks.dog AS `dog`))

应用表达式

scala> ddf.show(false)
+---+----+-----+------+-------+--------------------+
|_id|h   |inc  |op    |ts     |webhooks            |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 11, 111, 1111]]|
|fb1|fb11|fb111|fb1111|fb11111|[[2, 22, 222, 2222]]|
+---+----+-----+------+-------+--------------------+


scala> val columnsTobeUpdatedInWebhooks = Seq("zebra","dog") // Columns to be updated in webhooks.
columnsTobeUpdatedInWebhooks: Seq[String] = List(zebra, dog)

val expr = flatten(
    array(
        ddf
        .select(explode($"webhooks").as("webhooks"))
        .select("webhooks.*")
        .columns
        .map(c => if(columnsTobeUpdatedInWebhooks.contains(c)) col(s"webhooks.${c}").as(c) else array(lit(0)).as(c)):_*
    )
)

expr: org.apache.spark.sql.Column = flatten(array(array(0) AS `elephant`, array(0) AS `lion`, webhooks.zebra AS `zebra`, webhooks.dog AS `dog`))

最终模式

scala> ddf.withColumn("webhooks",struct(expr)).show(false)
+---+----+-----+------+-------+--------------+
|_id|h   |inc  |op    |ts     |webhooks      |
+---+----+-----+------+-------+--------------+
|fa1|fa11|fa111|fa1111|fa11111|[[0, 0, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 0, 1, 0]]|
+---+----+-----+------+-------+--------------+

下面的解决方案是否不起作用？？