spark scala中结构数组时如何更新列值
我只想知道,如果我有一列不想更新的列,是否可以将struct数组更新为某个值。 例如 如果我有一个列表[字符串]=列表(斑马,狗) 是否可以将列的所有其他数组设置为0,例如大象和狮子将为0spark scala中结构数组时如何更新列值,scala,apache-spark,Scala,Apache Spark,我只想知道,如果我有一列不想更新的列,是否可以将struct数组更新为某个值。 例如 如果我有一个列表[字符串]=列表(斑马,狗) 是否可以将列的所有其他数组设置为0,例如大象和狮子将为0 root |-- _id: string (nullable = true) |-- h: string (nullable = true) |-- inc: string (nullable = true) |-- op: string (nullable = true) |-- ts: stri
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- Animal: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Elephant: string (nullable = false)
| | |-- Lion: string (nullable = true)
| | |-- Zebra: string (nullable = true)
| | |-- Dog: string (nullable = true)
我是一行一行地迭代
就像我做了一个函数一样
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 1, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 1, 1, 0]]|
+---+----+-----+------+-------+--------------------+
After operations It will be
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[0, 0, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 0, 1, 0]]|
+---+----+-----+------+-------+--------------------+
但无法执行此操作请检查下面的代码
def changeValue(row :Row) = {
//some code
}
构造表达式
scala> ddf.show(false)
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 11, 111, 1111]]|
|fb1|fb11|fb111|fb1111|fb11111|[[2, 22, 222, 2222]]|
+---+----+-----+------+-------+--------------------+
scala> val columnsTobeUpdatedInWebhooks = Seq("zebra","dog") // Columns to be updated in webhooks.
columnsTobeUpdatedInWebhooks: Seq[String] = List(zebra, dog)
val expr = flatten(
array(
ddf
.select(explode($"webhooks").as("webhooks"))
.select("webhooks.*")
.columns
.map(c => if(columnsTobeUpdatedInWebhooks.contains(c)) col(s"webhooks.${c}").as(c) else array(lit(0)).as(c)):_*
)
)
expr: org.apache.spark.sql.Column = flatten(array(array(0) AS `elephant`, array(0) AS `lion`, webhooks.zebra AS `zebra`, webhooks.dog AS `dog`))
应用表达式
scala> ddf.show(false)
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 11, 111, 1111]]|
|fb1|fb11|fb111|fb1111|fb11111|[[2, 22, 222, 2222]]|
+---+----+-----+------+-------+--------------------+
scala> val columnsTobeUpdatedInWebhooks = Seq("zebra","dog") // Columns to be updated in webhooks.
columnsTobeUpdatedInWebhooks: Seq[String] = List(zebra, dog)
val expr = flatten(
array(
ddf
.select(explode($"webhooks").as("webhooks"))
.select("webhooks.*")
.columns
.map(c => if(columnsTobeUpdatedInWebhooks.contains(c)) col(s"webhooks.${c}").as(c) else array(lit(0)).as(c)):_*
)
)
expr: org.apache.spark.sql.Column = flatten(array(array(0) AS `elephant`, array(0) AS `lion`, webhooks.zebra AS `zebra`, webhooks.dog AS `dog`))
最终模式
scala> ddf.withColumn("webhooks",struct(expr)).show(false)
+---+----+-----+------+-------+--------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------+
|fa1|fa11|fa111|fa1111|fa11111|[[0, 0, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 0, 1, 0]]|
+---+----+-----+------+-------+--------------+
下面的解决方案是否不起作用??