Scala 验证列并在其他列中写入错误消息
我这样做时出错:Scala 验证列并在其他列中写入错误消息,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我这样做时出错: val input = spark.read.option("header", "true").option("delimiter", "\t").schema(trFile).csv(fileNameWithPath) val newSchema = trFile.add("ERROR_COMMENTS", StringType, true) // Call you custom validation function val validateDS = dataSetMa
val input = spark.read.option("header", "true").option("delimiter", "\t").schema(trFile).csv(fileNameWithPath)
val newSchema = trFile.add("ERROR_COMMENTS", StringType, true)
// Call you custom validation function
val validateDS = dataSetMap.map { row => validateColumns(row) } //<== error here
// Reconstruct the DataFrame with additional columns
val checkedDf = spark.createDataFrame(validateDS, newSchema)
def validateColumns(row: Row): Row = {
var err_val: String = null
val effective_date = row.getAs[String]("date")
.................
Row.merge(row, Row(err_val))
}
这是我的模式:
val FileSchema = StructType(
Array(
StructField("date", StringType),
StructField("count", StringType),
StructField("name", StringType)
))
我是新来的火花,让我知道,这里的问题是什么&有没有最好的方法来实现这一点。我使用的是Spark 2.3版。在这种情况下,使用
UDF
会更容易,那么您就不必担心scehma的更改,使用row.getAs
等
首先,将该方法转换为UDF
函数:
import org.apache.spark.sql.functions.udf
val validateColumns = udf((date: String, count: String, name: String)){
// error logic using the 3 column strings
err_val
}
要将新列添加到数据帧,请使用withColumn()
import org.apache.spark.sql.functions.udf
val validateColumns = udf((date: String, count: String, name: String)){
// error logic using the 3 column strings
err_val
}
val checkedDf = input.withColumn("ERROR_COMMENTS", validateColumns($"date", $"count", $"name"))