Apache spark 将toEpochDate与带有Spark Scala的数据帧一起使用的语法-优雅
以下是RDD在epochDate派生方面的优点和简单之处:Apache spark 将toEpochDate与带有Spark Scala的数据帧一起使用的语法-优雅,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,以下是RDD在epochDate派生方面的优点和简单之处: val rdd2 = rdd.map(x => (x._1, x._2, x._3, LocalDate.parse(x._2.toString).toEpochDay, LocalDate.parse(x._3.toString).toEpochDay)) RDD都是字符串类型。得到了预期的结果。例如,获取以下信息: ...(Mike,2018-09-25,2018-09-30
val rdd2 = rdd.map(x => (x._1, x._2, x._3,
LocalDate.parse(x._2.toString).toEpochDay, LocalDate.parse(x._3.toString).toEpochDay))
RDD都是字符串类型。得到了预期的结果。例如,获取以下信息:
...(Mike,2018-09-25,2018-09-30,17799,17804), ...
如果DF中有一个字符串,那么尝试同样的操作对我来说太棘手了,如果可能的话,我希望看到一些优雅的东西。像这样的东西和变化不起作用
val df2 = df.withColumn("s", $"start".LocalDate.parse.toString.toEpochDay)
获取:
我理解这一错误,但转换的优雅方式是什么?$“start”
的类型是ColumnName
而不是String
您需要定义一个自定义项
示例如下:
scala> import java.time._
import java.time._
scala> def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
toEpochDay: (s: String)Long
scala> val toEpochDayUdf = udf(toEpochDay(_: String))
toEpochDayUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))
scala> val df = List("2018-10-28").toDF
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.withColumn("s", toEpochDayUdf($"value")).collect
res0: Array[org.apache.spark.sql.Row] = Array([2018-10-28,17832])
scala>import java.time_
导入java.time_
scala>def-toEpochDay(s:String)=LocalDate.parse(s).toEpochDay
toEpochDay:(s:字符串)长
scala>val-toEpochDayUdf=udf(toEpochDay(u97;:字符串))
toEpochDayUdf:org.apache.spark.sql.expressions.UserDefinedFunction=UserDefinedFunction(,LongType,Some(List(StringType)))
scala>val df=List(“2018-10-28”)。toDF
df:org.apache.spark.sql.DataFrame=[value:string]
scala>df.withColumn(“s”,toEpochDayUdf($“value”))。收集
res0:Array[org.apache.spark.sql.Row]=数组([2018-10-2817832])
您可以将到_epoch_day
定义为datediff,从epoch开始:
import org.apache.spark.sql.functions.{datediff,lit,to_date}
导入org.apache.spark.sql.Column
def to_epoch_day(c:列)=datediff(c,to_date(lit)(“1970-01-01”))
并将其直接应用于列
:
df.withColumn(“s”),to\u epoch\u day(to\u date($“start”))
只要字符串格式符合ISO 8601,您甚至可以跳过数据转换(它将通过datediff
隐式完成):
df.withColumn(“s”),to_epoch_day($“start”))
scala> import java.time._
import java.time._
scala> def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
toEpochDay: (s: String)Long
scala> val toEpochDayUdf = udf(toEpochDay(_: String))
toEpochDayUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))
scala> val df = List("2018-10-28").toDF
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.withColumn("s", toEpochDayUdf($"value")).collect
res0: Array[org.apache.spark.sql.Row] = Array([2018-10-28,17832])