.net 在spark中解析不同的时间戳格式_.net_Apache Spark

.net 在spark中解析不同的时间戳格式

.net apache-spark

.net 在spark中解析不同的时间戳格式,.net,apache-spark,.net,Apache Spark,我有一个csv文件，其中一些列是具有此格式“dd/MM/yyyy HH:MM:ss”的时间戳，而同一.csv文件中的其他列的时间戳格式是“dd-MM-yyyy HH:MM:ss”。在spark read csv文件中，我尝试了以下方法： SparkSession spark=SparkSession .Builder（） .AppName（“Spark项目”） .GetOrCreate（）； spark.Read（） .Option（“分隔符”，fileconfig.fileLoaderCol

我有一个csv文件，其中一些列是具有此格式“dd/MM/yyyy HH:MM:ss”的时间戳，而同一.csv文件中的其他列的时间戳格式是“dd-MM-yyyy HH:MM:ss”。在spark read csv文件中，我尝试了以下方法：

SparkSession spark=SparkSession
.Builder（）
.AppName（“Spark项目”）
.GetOrCreate（）；
spark.Read（）
.Option（“分隔符”，fileconfig.fileLoaderColumnSeptor）
.选项（“标题”，hashheader）
.Option（“推断模式”，true）
.选项（“时间戳格式”，“dd/MM/yyyy HH:MM:ss”）
.选项（“时间戳格式”，“dd-MM-yyy-HH:MM:ss”）
.选项（“TreatEmptyValuesAsNulls”，true）
.Option（“忽略前导空格”，true）
.Option（“IgnoreTrailingWhiteSpace”，true）
.Csv（路径）；

但在这种情况下，它只假设最后一个timestamp格式为时间戳，第一个timestamp格式为字符串。我还尝试了.Option（“TimeStampFormat”，“dd-MM-yyyy-HH:MM:ss”，“dd/MM/yyyy-HH:MM:ss”）和.Option（“TimeStampFormat”，“dd-MM-yyyy-HH:MM:ss，dd/MM/yyyyy-HH:MM:ss”），但这些选项都不起作用。如何解析这两种类型的时间戳格式

如果我省略选项timestamp format所有时间戳都保存为字符串

在读取csv文件时无法指定两种时间戳格式，将使用默认的

最后一种时间戳格式

，其他所有内容都将被

覆盖

这些是我可能想到的选项：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

spark.read().
    option("delimiter", fileconfig.FileLoaderColumnSeparator).
    option("header", hasHeader).
    option("inferSchema", true).
    option("TimeStampFormat", "dd-MM-yyyy HH:mm:ss").
    option("TreatEmptyValuesAsNulls", true).
    option("IgnoreLeadingWhiteSpace", true).
    option("IgnoreTrailingWhiteSpace", true).
    csv(path).
    withColumn("<MM/dd/yyyy_field_name>",to_timestamp(col("<MM/dd/yyyy_field_name>","MM/dd/yyyy HH:mm:ss"))

df.withColumn("<MM/dd/yyyy_field_name>",from_unixtime(unix_timestamp(col("<MM/dd/yyyy_field_name>"),"dd/MM/yyyy HH:mm:ss"),"yyyy-MM-dd HH:mm:ss").cast("timestamp"))

df.withColumn("<MM/dd/yyyy_field_name>",unix_timestamp(col("<MM/dd/yyyy_field_name>"),"dd/MM/yyyy HH:mm:ss").cast("timestamp"))

1.读取csv文件时使用withColumn:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

spark.read().
    option("delimiter", fileconfig.FileLoaderColumnSeparator).
    option("header", hasHeader).
    option("inferSchema", true).
    option("TimeStampFormat", "dd-MM-yyyy HH:mm:ss").
    option("TreatEmptyValuesAsNulls", true).
    option("IgnoreLeadingWhiteSpace", true).
    option("IgnoreTrailingWhiteSpace", true).
    csv(path).
    withColumn("<MM/dd/yyyy_field_name>",to_timestamp(col("<MM/dd/yyyy_field_name>","MM/dd/yyyy HH:mm:ss"))

df.withColumn("<MM/dd/yyyy_field_name>",from_unixtime(unix_timestamp(col("<MM/dd/yyyy_field_name>"),"dd/MM/yyyy HH:mm:ss"),"yyyy-MM-dd HH:mm:ss").cast("timestamp"))

df.withColumn("<MM/dd/yyyy_field_name>",unix_timestamp(col("<MM/dd/yyyy_field_name>"),"dd/MM/yyyy HH:mm:ss").cast("timestamp"))

2.然后使用to_timestamp函数将类型更改为string to timestamp:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

spark.read().
    option("delimiter", fileconfig.FileLoaderColumnSeparator).
    option("header", hasHeader).
    option("inferSchema", true).
    option("TimeStampFormat", "dd-MM-yyyy HH:mm:ss").
    option("TreatEmptyValuesAsNulls", true).
    option("IgnoreLeadingWhiteSpace", true).
    option("IgnoreTrailingWhiteSpace", true).
    csv(path).
    withColumn("<MM/dd/yyyy_field_name>",to_timestamp(col("<MM/dd/yyyy_field_name>","MM/dd/yyyy HH:mm:ss"))

df.withColumn("<MM/dd/yyyy_field_name>",from_unixtime(unix_timestamp(col("<MM/dd/yyyy_field_name>"),"dd/MM/yyyy HH:mm:ss"),"yyyy-MM-dd HH:mm:ss").cast("timestamp"))

df.withColumn("<MM/dd/yyyy_field_name>",unix_timestamp(col("<MM/dd/yyyy_field_name>"),"dd/MM/yyyy HH:mm:ss").cast("timestamp"))

从火花>=2.2

df.withColumn("<MM/dd/yyyy_field_name>",to_timestamp(col("<MM/dd/yyyy_field_name>","MM/dd/yyyy HH:mm:ss"))

4.使用unix\u时间戳并强制转换为时间戳类型：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

spark.read().
    option("delimiter", fileconfig.FileLoaderColumnSeparator).
    option("header", hasHeader).
    option("inferSchema", true).
    option("TimeStampFormat", "dd-MM-yyyy HH:mm:ss").
    option("TreatEmptyValuesAsNulls", true).
    option("IgnoreLeadingWhiteSpace", true).
    option("IgnoreTrailingWhiteSpace", true).
    csv(path).
    withColumn("<MM/dd/yyyy_field_name>",to_timestamp(col("<MM/dd/yyyy_field_name>","MM/dd/yyyy HH:mm:ss"))

df.withColumn("<MM/dd/yyyy_field_name>",from_unixtime(unix_timestamp(col("<MM/dd/yyyy_field_name>"),"dd/MM/yyyy HH:mm:ss"),"yyyy-MM-dd HH:mm:ss").cast("timestamp"))

df.withColumn("<MM/dd/yyyy_field_name>",unix_timestamp(col("<MM/dd/yyyy_field_name>"),"dd/MM/yyyy HH:mm:ss").cast("timestamp"))