Apache spark 按条件应用UDF的优雅方式_Apache Spark_Apache Spark Sql_User Defined Functions

Apache spark 按条件应用UDF的优雅方式

apache-spark

Apache spark 按条件应用UDF的优雅方式,apache-spark,apache-spark-sql,user-defined-functions,Apache Spark,Apache Spark Sql,User Defined Functions,我有一些输入文件，所有文件共享相同的模式。它们都有一个名为channel\u id的字段，但对于file1，channel\u id=1，对于file2，channel\u id=2 我需要对这些文件进行一些ETL。但是，对于不同的文件，逻辑是不同的。例如，有一个UDF来计算channel\u name val getChannelNameUdf : UserDefinedFunction = udf((channelId: Integer) => { if (channelId

我有一些输入文件，所有文件共享相同的模式。它们都有一个名为

channel\u id

的字段，但对于

file1

，

channel\u id=1

，对于

file2

，

channel\u id=2

我需要对这些文件进行一些ETL。但是，对于不同的文件，逻辑是不同的。例如，有一个UDF来计算

channel\u name

val getChannelNameUdf : UserDefinedFunction = udf((channelId: Integer) => {
    if (channelId == 1) {
      "English"
    } else if (channelId == 2) {
      "French"
    } else {
      ""
    }
  })

由于我们有多个频道，使用

if-else

似乎并不优雅。是否有更优雅的方式或合适的设计模式来编写代码？非常感谢。

你好，布鲁克林，欢迎来到StackOverflow

您可以在UDF中使用模式匹配，但我建议您使用

when

内置函数，而不是定义自己的UDF

要回答您的请求，您可能需要以下代码：

val getChannelNameUdf = udf[String, Int] { _ match {
  case 1 => "English"
  case 2 => "French"
  case _ => ""
}}

或者更好，只需匿名函数：

val getChannelNameUdf = udf[String, Int] {
  case 1 => "English"
  case 2 => "French"
  case _ => ""
}

下面是使用when内置函数的示例：

val getChannelName = {col: Column =>
  when(col === 1, "English").when(col === 2, "French").otherwise("")
}
df.withColumn("channelName", getChannelName($"channelId"))

编辑：对于更通用的方法，您可以使用以下定义：

val rules = Map(1 -> "English", 2 -> "French")
val getChannelName = {col: Column =>
  rules.foldLeft(lit("")){case (c, (i,label)) =>
    when(col === i, label).otherwise(c)
  }
}

然后

df.withColumn("channelName", getChannelName($"channelId"))

是否有更优雅的方式或合适的设计模式来编写代码

对!！一种简单有效的方法是使用

join

您可以拥有一个包含所有通道引用的文件，假设它具有以下结构：

channel\u id，channel\u name

，然后连接两个数据帧。大概是这样的：

val df_channels = spark.read.csv("/path/to/channel_file.csv")

val result = df.join(df_channels, Seq("channel_id"),"left")