Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 将字符串转换为日期SparkSQL_Scala_Apache Spark_Dataframe - Fatal编程技术网

Scala 将字符串转换为日期SparkSQL

Scala 将字符串转换为日期SparkSQL,scala,apache-spark,dataframe,Scala,Apache Spark,Dataframe,我是Scala的新手,我有一个dataframe,我正在尝试一列dataframe的日期,从字符串开始,换言之,如下所示 1) yyyyMMddHHmmss(20150610120256) ->yyyy-MM-dd HH:mm:ss(2015-06-10 12:02:56) 2) yyyyMMddHHmmss(20150611 ) ->yyyy-MM-dd(2015-06-11) 第一种情况我能成功完成,但第二种情况有问题,由于这一原因,我无法将时间转换为日

我是Scala的新手,我有一个dataframe,我正在尝试一列dataframe的日期,从字符串开始,换言之,如下所示

1)    yyyyMMddHHmmss(20150610120256) ->yyyy-MM-dd HH:mm:ss(2015-06-10 12:02:56) 
2)    yyyyMMddHHmmss(20150611      ) ->yyyy-MM-dd(2015-06-11)
第一种情况我能成功完成,但第二种情况有问题,由于这一原因,我无法将时间转换为日期。更多详细信息,您可以在下面获得。任何帮助将不胜感激

df.printSchema
root
 |-- TYPE: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- SQ_CODE: string (nullable = true)
 |-- RE_TYPE: string (nullable = true)
 |-- VERY_ID: long (nullable = true)
 |-- IN_DATE: string (nullable = true)


df.show
Input  
+-----+-------+---------+---------+-------------------+-----------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE        |
+-----+-------+---------+---------+-------------------+-----------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 20150610120256  |
|   F | 000544|  2017002|      LD |  95032015062763261| 20150611        |
|   F | 000544|  2017002|      AK |  95037854336743246| 20150611012356  |
+-----+-------+---------+--+------+-------------------+-----------------+

df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
        to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("date")))
        .otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))

Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
+-----+-------+---------+---------+-------------------+----------------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
|   F | 000544|  2017002|      LD |  95032015062763261| null                 |
|   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
+-----+-------+---------+--+------+-------------------+----------------------+

df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
        to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("timestamp")))
        .otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))

Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
+-----+-------+---------+---------+-------------------+----------------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
|   F | 000544|  2017002|      LD |  95032015062763261| 2015-06-11 00:00:00  |
|   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
+-----+-------+---------+--+------+-------------------+----------------------+


Expected output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
+-----+-------+---------+---------+-------------------+----------------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
|   F | 000544|  2017002|      LD |  95032015062763261| 2015-06-11           |
|   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
+-----+-------+---------+--+------+-------------------+----------------------+
请尝试此查询

df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast(DateType)))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType)))

有几个选项可以实现日期解析器

  • 使用内置的spark sql函数
    TODATE()
    。这一执行的重要性
  • 创建一个用户定义的函数,在该函数中,您可以根据喜欢的输入格式执行不同的日期解析,并返回字符串。了解更多有关UDF的信息
    2015-06-11
    格式为
    spark.sql.types.DateType
    2015-06-10 12:02:56
    is
    spark.sql.types.TimestampType

    同一列上不能有两个数据类型。架构的每一列应该只有一个数据类型

    我建议您创建两个新的专栏,并在其中使用您希望的格式

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.types.{DateType, TimestampType}
    df.withColumn("IN_DATE_DateOnly",from_unixtime(unix_timestamp(df("IN_DATE"),"yyyyMMdd")).cast(DateType))
      .withColumn("IN_DATE_DateAndTime",unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType)) 
    
    这将为您提供
    dataframe
    as

    +----+------+-------+-------+-----------------+--------------+----------------+---------------------+
    |TYPE|CODE  |SQ_CODE|RE_TYPE|VERY_ID          |IN_DATE       |IN_DATE_DateOnly|IN_DATE_DateAndTime  |
    +----+------+-------+-------+-----------------+--------------+----------------+---------------------+
    |F   |000544|2017002|OP     |95032015062763298|20150610120256|null            |2015-06-10 12:02:00.0|
    |F   |000544|2017002|LD     |95032015062763261|20150611      |2015-06-11      |null                 |
    |F   |000544|2017002|AK     |95037854336743246|20150611012356|null            |2015-06-11 01:23:00.0|
    +----+------+-------+-------+-----------------+--------------+----------------+---------------------+
    
    您可以看到数据类型是不同的

    root
     |-- TYPE: string (nullable = true)
     |-- CODE: string (nullable = true)
     |-- SQ_CODE: string (nullable = true)
     |-- RE_TYPE: string (nullable = true)
     |-- VERY_ID: string (nullable = true)
     |-- IN_DATE: string (nullable = true)
     |-- IN_DATE_DateOnly: date (nullable = true)
     |-- IN_DATE_DateAndTime: timestamp (nullable = true)
    
    我希望答案有帮助

    我会的

    • 选择更精确的数据类型-此处
      TimestampType
    • 以不同的格式合并

    这是不可能的,因为两个日期的数据类型不同。一个是TimestampType,另一个是DateType,同一列不能有两个模式?我们在spark sql中有
    trim
    import org.apache.spark.sql.functions._
    
    val df = Seq("20150610120256", "20150611").toDF("IN_DATE")
    
    df.withColumn("IN_DATE", coalesce(
      to_timestamp($"IN_DATE", "yyyyMMddHHmmss"), 
      to_timestamp($"IN_DATE", "yyyyMMdd"))).show
    
    
    +-------------------+
    |            IN_DATE|
    +-------------------+
    |2015-06-10 12:02:56|
    |2015-06-11 00:00:00|
    +-------------------+