Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/json/14.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python pyspark无法解析json中的时间戳_Python_Json_Timestamp_Pyspark - Fatal编程技术网

Python pyspark无法解析json中的时间戳

Python pyspark无法解析json中的时间戳,python,json,timestamp,pyspark,Python,Json,Timestamp,Pyspark,例如,给定以下json(名为“json”): {"myTime": "2016-10-26 18:19:15"} 以及以下python脚本: from pyspark import SparkContext from pyspark import SparkConf from pyspark.sql import SQLContext conf = SparkConf().setAppName('simpleTest') sc = SparkContext(conf=conf)

例如,给定以下json(名为“json”):

{"myTime": "2016-10-26 18:19:15"}
以及以下python脚本:

from pyspark import SparkContext  
from pyspark import SparkConf     
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('simpleTest')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print sc.version
json_file = 'json'
df = sqlContext.read.json(json_file,timestampFormat='yyyy-MM-dd HH:mm:ss')
df.printSchema()
输出为:

2.0.2
root
 |-- myTime: string (nullable = true)
我希望模式被定义为时间戳。
我缺少什么?

您需要明确定义一个模式:

from pyspark.sql.types import StructType, StructField, TimestampType

schema = StructType([StructField("myTime", TimestampType(), True)])

df = spark.read.json(json_file, schema=schema, timestampFormat="yyyy-MM-dd HH:mm:ss")
这将输出:

>>> df.collect()
[Row(myTime=datetime.datetime(2016, 10, 26, 18, 19, 15))]
>>> df.printSchema()
root
 |-- myTime: timestamp (nullable = true)

>>>

除了Dat Tran解决方案之外,您还可以在读取文件后直接将
cast
应用于数据帧列

# example
from pyspark.sql import Row
json = [Row(**{"myTime": "2016-10-26 18:19:15"})]
df = spark.sparkContext.parallelize(json).toDF()

# using cast to 'timestamp' format
df_time = df.select(df['myTime'].cast('timestamp'))
df_time.printSchema()

root
 |-- myTime: timestamp (nullable = true)

在大型数据集上这样做会不会影响性能,因为我们执行了2个数据帧操作?这会花费一些时间,但不会太多,因为这只是对列应用
cast
函数。您还可以将out-cast列替换回原始数据帧。因此定义“timestampFormat”的要点是是当试图将时间戳字符串放入架构中的“TimestampType()”列时,将如何解释时间戳字符串?