在java spark中更新struct type列中的值_Java_Apache Spark_Apache Spark Sql

在java spark中更新struct type列中的值

java apache-spark

在java spark中更新struct type列中的值,java,apache-spark,apache-spark-sql,Java,Apache Spark,Apache Spark Sql,我希望能够更新嵌套数据集中的值。为此，我在Spark中创建了一个嵌套数据集。它具有以下架构结构：- root |-- field_a: string (nullable = false) |-- field_b: struct (nullable = true) | |-- field_d: struct(nullable = false) |-- field_not_to_update: string(nullable = true) |

我希望能够更新嵌套数据集中的值。为此，我在Spark中创建了一个嵌套数据集。它具有以下架构结构：-

root

 |-- field_a: string (nullable = false)

 |-- field_b: struct (nullable = true)

 |    |-- field_d: struct(nullable = false)
          |-- field_not_to_update: string(nullable = true)

 |        |-- field_to_update: string(nullable = false)
 |   field_c: string (nullable = false)

现在我想将

字段中的值更新为数据集中的\u update

。我试过了

aFooData.withColumn("field_b.field_d.field_to_update", lit("updated_val")

也试过,

aFooData.foreach(new ClassWithForEachFunction());

其中

ClassWithForEachFunction实现ForEachFunction

并具有方法

公共无效调用（Row）

以将字段更新为更新属性。在lamda中也尝试了同样的方法，但它抛出了任务不可序列化的异常，所以必须进行长时间的处理

到目前为止，没有一个是有效的，在第二种情况下，我使用foreach获得相同的数据集，并使用名为

field\b.field\d.field\u to\u update

的新列。是否有其他相同的建议？

请检查下面的代码

从struct中提取字段
更新所需的文件
重新构建结构

你必须重建整个模式，你可以用下面的句子在一个实例中完成

import org.apache.spark.sql.functions.{lit, struct}

df.select(
  df("field_a"), // keep the fields that don't change
  struct( // the field at first level must be reconstructed
     lit("updated_value") as "field_to_update", // transform or set the new elements
     df("fb.field_not_to_update") as "field_not_to_update" // keep the unchanged sub elements and keep the last name
  ) as "field_b", // and we have to keep the name
  df("field_c")
)

java中的语法将是相同的，更“类似java”的方法是将数据帧转换为（类型化的）数据集，然后使用调用更改数据。从Java的角度来看，代码很容易处理。但缺点是，对于给定的模式，您需要三个

Dataset ds=df.as（Encoders.bean（Bean1.class））；
Dataset UpdateDS=ds.map（（MapFunction）行->{
row.getField_b（）.getField_d（）.setField_to_update（“更新”）；
返回行；
}bean（Bean1.class））；

使用三个Bean类

公共静态类Bean1实现了可序列化{
私有字符串字段_a；
私人Bean2油田；
私有字符串字段_c；
//接球手和接球手
}
公共静态类Bean2实现了可序列化{
私人Bean3油田；
//接二连三
}
公共静态类Bean3实现了可序列化{
私有字符串字段\u未\u更新；
私有字符串字段\u到\u更新；
//接球手和接球手
}

谢谢你，但我需要用Java来做。如果你能把它改成scala，你可以使用spark optics来简化@Yashu的修改，Java和scala的语法几乎相同。如果它不起作用，请告诉我：）

import org.apache.spark.sql.functions.{lit, struct}

df.select(
  df("field_a"), // keep the fields that don't change
  struct( // the field at first level must be reconstructed
     lit("updated_value") as "field_to_update", // transform or set the new elements
     df("fb.field_not_to_update") as "field_not_to_update" // keep the unchanged sub elements and keep the last name
  ) as "field_b", // and we have to keep the name
  df("field_c")
)