Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 将CSV读入带有时间戳和日期类型的Spark数据框_Apache Spark_Apache Spark Sql_Apache Spark 1.6 - Fatal编程技术网

Apache spark 将CSV读入带有时间戳和日期类型的Spark数据框

Apache spark 将CSV读入带有时间戳和日期类型的Spark数据框,apache-spark,apache-spark-sql,apache-spark-1.6,Apache Spark,Apache Spark Sql,Apache Spark 1.6,它是CDH,火花1.6 我正在尝试将这个假设的CSV导入apache Spark数据帧: $ hadoop fs -cat test.csv a,b,c,2016-09-09,a,2016-11-11 09:09:09.0,a a,b,c,2016-09-10,a,2016-11-11 09:09:10.0,a 我使用databricks csvjar val textData = sqlContext.read .format("com.databricks.spark.csv")

它是CDH,火花1.6

我正在尝试将这个假设的CSV导入apache Spark数据帧:

$ hadoop fs -cat test.csv
a,b,c,2016-09-09,a,2016-11-11 09:09:09.0,a
a,b,c,2016-09-10,a,2016-11-11 09:09:10.0,a
我使用databricks csvjar

val textData = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "false")
    .option("delimiter", ",")
    .option("dateFormat", "yyyy-MM-dd HH:mm:ss")
    .option("inferSchema", "true")
    .option("nullValue", "null")
    .load("test.csv")
我使用inferSchema为生成的数据帧创建模式。函数为上述代码提供以下输出:

scala> textData.printSchema()
root
 |-- C0: string (nullable = true)
 |-- C1: string (nullable = true)
 |-- C2: string (nullable = true)
 |-- C3: string (nullable = true)
 |-- C4: string (nullable = true)
 |-- C5: timestamp (nullable = true)
 |-- C6: string (nullable = true)

scala> textData.show()
+---+---+---+----------+---+--------------------+---+
| C0| C1| C2|        C3| C4|                  C5| C6|
+---+---+---+----------+---+--------------------+---+
|  a|  b|  c|2016-09-09|  a|2016-11-11 09:09:...|  a|
|  a|  b|  c|2016-09-10|  a|2016-11-11 09:09:...|  a|
+---+---+---+----------+---+--------------------+---+
C3列具有字符串类型。我希望C3有日期类型。为了使它成为最新类型,我尝试了以下代码

val textData = sqlContext.read.format("com.databricks.spark.csv")
    .option("header", "false")
    .option("delimiter", ",")
    .option("dateFormat", "yyyy-MM-dd")
    .option("inferSchema", "true")
    .option("nullValue", "null")
    .load("test.csv")

scala> textData.printSchema
root
 |-- C0: string (nullable = true)
 |-- C1: string (nullable = true)
 |-- C2: string (nullable = true)
 |-- C3: timestamp (nullable = true)
 |-- C4: string (nullable = true)
 |-- C5: timestamp (nullable = true)
 |-- C6: string (nullable = true)

scala> textData.show()
+---+---+---+--------------------+---+--------------------+---+
| C0| C1| C2|                  C3| C4|                  C5| C6|
+---+---+---+--------------------+---+--------------------+---+
|  a|  b|  c|2016-09-09 00:00:...|  a|2016-11-11 00:00:...|  a|
|  a|  b|  c|2016-09-10 00:00:...|  a|2016-11-11 00:00:...|  a|
+---+---+---+--------------------+---+--------------------+---+
此代码与第一个块之间的唯一区别是日期格式选项行(我使用“yyyy-MM-dd”而不是“yyyy-MM-dd HH:MM:ss”)。现在我将C3和C5都作为时间戳
(C3仍然不是日期)。但是对于C5,HH::mm:ss部分被忽略,并在数据中显示为零

理想情况下,我希望C3的类型为date,C5的类型为timestamp,其HH:mm:ss部分不被忽略。我现在的解决方案是这样的。我通过从数据库中并行提取数据来生成csv。我确保将所有日期作为时间戳(不理想)。现在,测试csv如下所示:

$ hadoop fs -cat new-test.csv
a,b,c,2016-09-09 00:00:00,a,2016-11-11 09:09:09.0,a
a,b,c,2016-09-10 00:00:00,a,2016-11-11 09:09:10.0,a
这是我最后的工作代码:

val textData = sqlContext.read.format("com.databricks.spark.csv")
    .option("header", "false")
    .option("delimiter", ",")
    .option("dateFormat", "yyyy-MM-dd HH:mm:ss")
    .schema(finalSchema)
    .option("nullValue", "null")
    .load("new-test.csv")
在这里,我在dateFormat中使用完整的时间戳格式(“yyyy-MM-dd HH:MM:ss”)。我手动创建finalSchema实例,其中c3是日期,C5是时间戳类型(Spark sql类型)。我使用schema()函数应用这些模式。输出如下所示:

scala> finalSchema
res4: org.apache.spark.sql.types.StructType = StructType(StructField(C0,StringType,true), StructField(C1,StringType,true), StructField(C2,StringType,true), StructField(C3,DateType,true), StructField(C4,StringType,true), StructField(C5,TimestampType,true), StructField(C6,StringType,true))

scala> textData.printSchema()
root
 |-- C0: string (nullable = true)
 |-- C1: string (nullable = true)
 |-- C2: string (nullable = true)
 |-- C3: date (nullable = true)
 |-- C4: string (nullable = true)
 |-- C5: timestamp (nullable = true)
 |-- C6: string (nullable = true)


scala> textData.show()
+---+---+---+----------+---+--------------------+---+
| C0| C1| C2|        C3| C4|                  C5| C6|
+---+---+---+----------+---+--------------------+---+
|  a|  b|  c|2016-09-09|  a|2016-11-11 09:09:...|  a|
|  a|  b|  c|2016-09-10|  a|2016-11-11 09:09:...|  a|
+---+---+---+----------+---+--------------------+---+
是否有一种更简单或现成的方法来解析csv文件(同时包含日期和时间戳类型到spark数据帧?

相关链接:


对于非常见情况,如果使用推断选项,则可能不会返回预期结果。如图所示:


它只会尝试将每个列与时间戳类型相匹配,而不是日期类型,因此这种情况下的“开箱即用解决方案”是不可能的解决方案是使用直接定义模式,它将避免使用推断选项设置一个仅匹配RDD而不是整个数据的类型。您的最终模式是一个有效的解决方案。

它不是很优雅,但您可以像这样从时间戳转换为日期(检查最后一行):

val textData=sqlContext.read.format(“com.databricks.spark.csv”)
.选项(“标题”、“假”)
.选项(“分隔符“,”,”)
.选项(“日期格式”、“yyyy-MM-dd”)
.选项(“推断模式”、“真”)
.选项(“空值”、“空值”)
.load(“test.csv”)
.withColumn(“C4”,expr(“截止日期(C4)”))
if (field == null || field.isEmpty || field == nullValue) {
  typeSoFar
} else {
  typeSoFar match {
    case NullType => tryParseInteger(field)
    case IntegerType => tryParseInteger(field)
    case LongType => tryParseLong(field)
    case DoubleType => tryParseDouble(field)
    case TimestampType => tryParseTimestamp(field)
    case BooleanType => tryParseBoolean(field)
    case StringType => StringType
    case other: DataType =>
      throw new UnsupportedOperationException(s"Unexpected data type $other")