Apache spark 在scala中使用apache spark更新数据集中的列_Apache Spark

Apache spark 在scala中使用apache spark更新数据集中的列

apache-spark

Apache spark 在scala中使用apache spark更新数据集中的列,apache-spark,Apache Spark,我在数据集中存储了泰坦尼克号数据集。我想从现有数据集创建新数据集。这将修改泰坦尼克号数据集的sex列为child，如果此人的年龄小于16，如下所示 def isChild(age:String):String={ if(age.toDouble<16) { "Child" }else { age } } 任何帮助，因为我想根据数据集的年龄列4修改数据集的第4列，并处理NULL值根据问题的理解，执行以下操作应该会有所帮助 import org.apac

我在数据集中存储了泰坦尼克号数据集。我想从现有数据集创建新数据集。这将修改泰坦尼克号数据集的

sex

列为

child

，如果此人的年龄小于

，如下所示

def isChild(age:String):String={
  if(age.toDouble<16)
  {
    "Child"
  }else
  {
    age
  }
}

任何帮助，因为我想根据数据集的年龄列4修改数据集的第4列，并处理

NULL

值

根据问题的理解，执行以下操作应该会有所帮助

import org.apache.spark.sql.functions._
titanic_df.na.drop.withColumn("sex", when(col("age") < 16, lit("Child")).otherwise(col("age"))).show()

泰坦尼克号df的样本和预期的输出应该会有很大帮助，所以请更新你的问题：）现在更清楚了。请参阅下面我的更新答案。如果它有帮助，请投票并接受这是有帮助的。您能建议我如何为上述查询处理col（'sex'）中的null吗。非常感谢！！您想筛选非空行，对吗？sex列是StringType吗？事实上，如果我像这样查询titanic_df.withColumn（“sex”），当（col（“age”）<16，lit（“Child”）。否则（col（“age”）））。show（）它不会处理带有null的age列。我没意见。上面的问题解决了。仅供参考，是的，性专栏是一个字符串类型。你更新的问题说明了这一点。请看我的最新答案：）这就是你想要的答案，请不要忘记投票并接受

import org.apache.spark.sql.functions._
titanic_df.na.drop.withColumn("sex", when(col("age") < 16, lit("Child")).otherwise(col("age"))).show()

titanic_df.withColumn("age",when(col("age").isNull, lit(2)).otherwise(col("age")))
    .withColumn("sex", when(col("age") < 16, lit("Child")).otherwise(col("sex"))).show()