Scala spark解析的时间戳不正确
我有一个如下格式的CSV文件:Scala spark解析的时间戳不正确,scala,apache-spark,dataframe,apache-zeppelin,Scala,Apache Spark,Dataframe,Apache Zeppelin,我有一个如下格式的CSV文件: 574,REF009,3213,16384,3258,111,512,2013-12-07 21:03:12.567+01,2013-12-07 21:03:12.567+01,2013-12-31 23:33:15.821+01,/data/ath/athdisk/ro/user/bas/b6/c0 48,REF010,456,32768,3258,111,2175850,2018-07-10 04:37:06.495+02,2018-07-10 04:37:0
574,REF009,3213,16384,3258,111,512,2013-12-07 21:03:12.567+01,2013-12-07 21:03:12.567+01,2013-12-31 23:33:15.821+01,/data/ath/athdisk/ro/user/bas/b6/c0
48,REF010,456,32768,3258,111,2175850,2018-07-10 04:37:06.495+02,2018-07-10 04:37:06.459+02,2018-07-10 04:37:06.648+02,/data/ath/athdisk/ro/mc15/b9/dc/lo.log.tgz.1
1758,REF011,123,32768,3258,111,31691926,2017-04-21 22:29:30.315+02,2017-10-20 05:55:03.959+02,2017-04-21 22:29:31+02,/data/ath/athdisk/ro/dataV/1f/00/D0293.pool.root
val inodes_schema = StructType(
Array(
StructField("testID",LongType,false),
StructField("ref",StringType,false),
StructField("iref",IntegerType, false),
StructField("flag",IntegerType, false),
StructField("iuid",IntegerType, false),
StructField("igid",IntegerType, false),
StructField("isize",LongType, false),
StructField("icrtime",TimestampType,false),
StructField("iatime",TimestampType,false),
StructField("ictime",TimestampType,false),
StructField("path",StringType,false)
)
)
val inodes_table = spark.read.option("mode", "FAILFAST")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSX")
.schema(inodes_schema)
.option("delimiter",",")
.option("header",false).csv("/my/csv/file.csv")
当我试图导入这个11B行长的大文件时,我有大约4M行的空值。我意识到我的文件有问题,所以我尝试使用FAILFAST选项运行导入,如下所示:
574,REF009,3213,16384,3258,111,512,2013-12-07 21:03:12.567+01,2013-12-07 21:03:12.567+01,2013-12-31 23:33:15.821+01,/data/ath/athdisk/ro/user/bas/b6/c0
48,REF010,456,32768,3258,111,2175850,2018-07-10 04:37:06.495+02,2018-07-10 04:37:06.459+02,2018-07-10 04:37:06.648+02,/data/ath/athdisk/ro/mc15/b9/dc/lo.log.tgz.1
1758,REF011,123,32768,3258,111,31691926,2017-04-21 22:29:30.315+02,2017-10-20 05:55:03.959+02,2017-04-21 22:29:31+02,/data/ath/athdisk/ro/dataV/1f/00/D0293.pool.root
val inodes_schema = StructType(
Array(
StructField("testID",LongType,false),
StructField("ref",StringType,false),
StructField("iref",IntegerType, false),
StructField("flag",IntegerType, false),
StructField("iuid",IntegerType, false),
StructField("igid",IntegerType, false),
StructField("isize",LongType, false),
StructField("icrtime",TimestampType,false),
StructField("iatime",TimestampType,false),
StructField("ictime",TimestampType,false),
StructField("path",StringType,false)
)
)
val inodes_table = spark.read.option("mode", "FAILFAST")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSX")
.schema(inodes_schema)
.option("delimiter",",")
.option("header",false).csv("/my/csv/file.csv")
这使我能够确定包含59+02的行是导致问题的原因。由于有很多行包含59+02,如果我使用常规许可模式,我最终设法将其中一行识别为未正确导入:
我不明白为什么Spark没有正确解析这行代码?关于我的时间戳,05:55:03.959+02小时格式是正确的,但行不会正确导入,可能会导入很多行。上述问题似乎是由于数据中有多个时间戳格式造成的。解决方法是使TimestampType列StringType用于读取CSV,然后将其转换回TimestampType:
// /path/to/csvfile:
1,2017-04-21 22:29:30.315+02
2,2017-10-20 05:55:03.959+02
3,2017-04-21 22:29:31+02
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val schema = StructType(Array(
StructField("id", IntegerType, false),
StructField("dt", StringType, false)
))
val df = spark.read.
option("mode", "FAILFAST").
option("delimiter", ",").
option("header", false).
schema(schema).
csv("/path/to/csvfile")
df.select($"id", $"dt".cast(TimestampType)as("dt")).
show(false)
// +---+-----------------------+
// |id |dt |
// +---+-----------------------+
// |1 |2017-04-21 13:29:30.315|
// |2 |2017-10-19 20:55:03.959|
// |3 |2017-04-21 13:29:31 |
// +---+-----------------------+
您有两种不同的格式,因此使用单一格式输入看起来不是正确的方法。你难道不想尝试一下吗?@user8371915我的文件包含数百万条记录,在miliseond上2位数或3位数都有效,但有些不起作用