Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 将toEpochDate与带有Spark Scala的数据帧一起使用的语法-优雅_Apache Spark_Apache Spark Sql - Fatal编程技术网

Apache spark 将toEpochDate与带有Spark Scala的数据帧一起使用的语法-优雅

Apache spark 将toEpochDate与带有Spark Scala的数据帧一起使用的语法-优雅,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,以下是RDD在epochDate派生方面的优点和简单之处: val rdd2 = rdd.map(x => (x._1, x._2, x._3, LocalDate.parse(x._2.toString).toEpochDay, LocalDate.parse(x._3.toString).toEpochDay)) RDD都是字符串类型。得到了预期的结果。例如,获取以下信息: ...(Mike,2018-09-25,2018-09-30

以下是RDD在epochDate派生方面的优点和简单之处:

val rdd2 = rdd.map(x => (x._1, x._2, x._3,
                         LocalDate.parse(x._2.toString).toEpochDay, LocalDate.parse(x._3.toString).toEpochDay))
RDD都是字符串类型。得到了预期的结果。例如,获取以下信息:

...(Mike,2018-09-25,2018-09-30,17799,17804), ...
如果DF中有一个字符串,那么尝试同样的操作对我来说太棘手了,如果可能的话,我希望看到一些优雅的东西。像这样的东西和变化不起作用

val df2 = df.withColumn("s", $"start".LocalDate.parse.toString.toEpochDay) 
获取:

我理解这一错误,但转换的优雅方式是什么?

$“start”
的类型是
ColumnName
而不是
String

您需要定义一个自定义项

示例如下:

scala> import java.time._
import java.time._

scala> def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
toEpochDay: (s: String)Long

scala> val toEpochDayUdf = udf(toEpochDay(_: String))
toEpochDayUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))

scala> val df = List("2018-10-28").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.withColumn("s", toEpochDayUdf($"value")).collect
res0: Array[org.apache.spark.sql.Row] = Array([2018-10-28,17832])
scala>import java.time_
导入java.time_
scala>def-toEpochDay(s:String)=LocalDate.parse(s).toEpochDay
toEpochDay:(s:字符串)长
scala>val-toEpochDayUdf=udf(toEpochDay(u97;:字符串))
toEpochDayUdf:org.apache.spark.sql.expressions.UserDefinedFunction=UserDefinedFunction(,LongType,Some(List(StringType)))
scala>val df=List(“2018-10-28”)。toDF
df:org.apache.spark.sql.DataFrame=[value:string]
scala>df.withColumn(“s”,toEpochDayUdf($“value”))。收集
res0:Array[org.apache.spark.sql.Row]=数组([2018-10-2817832])

您可以将
到_epoch_day
定义为datediff,从epoch开始:

import org.apache.spark.sql.functions.{datediff,lit,to_date}
导入org.apache.spark.sql.Column
def to_epoch_day(c:列)=datediff(c,to_date(lit)(“1970-01-01”))
并将其直接应用于

df.withColumn(“s”),to\u epoch\u day(to\u date($“start”))
只要字符串格式符合ISO 8601,您甚至可以跳过数据转换(它将通过
datediff
隐式完成):

df.withColumn(“s”),to_epoch_day($“start”))
scala> import java.time._
import java.time._

scala> def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
toEpochDay: (s: String)Long

scala> val toEpochDayUdf = udf(toEpochDay(_: String))
toEpochDayUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))

scala> val df = List("2018-10-28").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.withColumn("s", toEpochDayUdf($"value")).collect
res0: Array[org.apache.spark.sql.Row] = Array([2018-10-28,17832])