Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/347.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python pyspark数据帧API转换(';时间戳';)在时间戳字符串上不起作用_Python_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql - Fatal编程技术网

Python pyspark数据帧API转换(';时间戳';)在时间戳字符串上不起作用

Python pyspark数据帧API转换(';时间戳';)在时间戳字符串上不起作用,python,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我有如下数据: {"id":1,"createdAt":"2016-07-01T16:37:41-0400"} {"id":2,"createdAt":"2016-07-01T16:37:41-0700"} {"id":3,"createdAt":"2016-07-01T16:37:41-0400"} {"id":4,"createdAt":"2016-07-01T16:37:41-0700"} {"id":5,"createdAt":"2016-07-06T09:48Z"} {"id":6,"

我有如下数据:

{"id":1,"createdAt":"2016-07-01T16:37:41-0400"}
{"id":2,"createdAt":"2016-07-01T16:37:41-0700"}
{"id":3,"createdAt":"2016-07-01T16:37:41-0400"}
{"id":4,"createdAt":"2016-07-01T16:37:41-0700"}
{"id":5,"createdAt":"2016-07-06T09:48Z"}
{"id":6,"createdAt":"2016-07-06T09:48Z"}
{"id":7,"createdAt":"2016-07-06T09:48Z"}
我将
createdAt
字段转换为时间戳,如下所示

from pyspark.sql import SQLContext
from pyspark.sql.functions import *

sqlContext = SQLContext(sc)
df = sqlContext.read.json('data/test.json')
dfProcessed = df.withColumn('createdAt', df.createdAt.cast('timestamp'))

dfProcessed.printSchema()
dfProcessed.collect()
我得到的输出如下。我没有为
createdAt
获取任何值。如何才能将字段检索为正确的时间戳

root
 |-- createdAt: timestamp (nullable = true)
 |-- id: long (nullable = true)

[Row(createdAt=None, id=1),
 Row(createdAt=None, id=2),
 Row(createdAt=None, id=3),
 Row(createdAt=None, id=4),
 Row(createdAt=None, id=5),
 Row(createdAt=None, id=6),
 Row(createdAt=None, id=7)]

为了简单地将字符串列转换为时间戳,必须正确格式化字符串列

要检索“createdAt”列作为时间戳,可以编写UDF函数来转换字符串

“2016-07-01T16:37:41-0400”

“2016-07-01 16:37:41”

并将“createdAt”列转换为新格式(不要忘记处理时区字段)

一旦有一个列包含时间戳作为字符串,如“2016-07-01 16:37:41”,一个简单的转换为时间戳就可以完成这项工作,就像代码中的一样


您可以阅读有关Spark中日期/时间/字符串处理的更多信息。

BTW,您使用的是什么版本的Apache Spark?了解。谢谢我希望避免UDF和额外的处理,内置的东西可以更有效地处理它。但是,似乎没有。