Pyspark py spark已成功写入无效日期,但在读取时引发异常
我正在使用py spark处理一些数据,并使用拼花格式将处理后的文件写入S3。批处理过程的代码在docker容器(Linux)中的Ec2上运行。该数据还包含一些日期时间字段,我将其另存为TimestampType(在拼花文件中),因为我需要在athena查询中支持该字段。如果此字段的值为“0001-01-01”,则批处理会将其成功写入拼花地板文件,但只有在读取此数据时,才会引发异常。这是linux机器上的行为。 以下是重新编程的示例代码-Pyspark py spark已成功写入无效日期,但在读取时引发异常,pyspark,Pyspark,我正在使用py spark处理一些数据,并使用拼花格式将处理后的文件写入S3。批处理过程的代码在docker容器(Linux)中的Ec2上运行。该数据还包含一些日期时间字段,我将其另存为TimestampType(在拼花文件中),因为我需要在athena查询中支持该字段。如果此字段的值为“0001-01-01”,则批处理会将其成功写入拼花地板文件,但只有在读取此数据时,才会引发异常。这是linux机器上的行为。 以下是重新编程的示例代码- from pyspark.sql.types impor
from pyspark.sql.types import StructType,StructField,DateType,TimestampType
from dateutil.parser import parse
d=parse('0001-01-01 00:00:00')
data=[{'createdon':d}]
distdata = sc.parallelize(data)
schema = StructType([StructField('createdon',TimestampType())])
df=spark.createDataFrame(distdata,schema)
df.write.parquet("\test-1")
执行此代码后,它会将数据无误地写入文件。当我试图阅读相同的内容时,我会发现下面的错误-
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 572, in take
return self.limit(num).collect()
File "/usr/local/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 535, in collect
return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))
File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 147, in load_stream
yield self._read_with_length(stream)
File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 172, in _read_with_length
return self.loads(obj)
File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 580, in loads
return pickle.loads(obj, encoding=encoding)
File "/usr/local/lib/python3.6/site-packages/pyspark/sql/types.py", line 1396, in <lambda>
return lambda *a: dataType.fromInternal(a)
File "/usr/local/lib/python3.6/site-packages/pyspark/sql/types.py", line 633, in fromInternal
for f, v, c in zip(self.fields, obj, self._needConversion)]
File "/usr/local/lib/python3.6/site-packages/pyspark/sql/types.py", line 633, in <listcomp>
for f, v, c in zip(self.fields, obj, self._needConversion)]
File "/usr/local/lib/python3.6/site-packages/pyspark/sql/types.py", line 445, in fromInternal
return self.dataType.fromInternal(obj)
File "/usr/local/lib/python3.6/site-packages/pyspark/sql/types.py", line 199, in fromInternal
return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)
ValueError: year 0 is out of range
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“/usr/local/lib/python3.6/site packages/pyspark/sql/dataframe.py”,第572行,在take中
返回self.limit(num.collect())
collect中的文件“/usr/local/lib/python3.6/site packages/pyspark/sql/dataframe.py”,第535行
返回列表(\u从\u套接字加载\u(sock\u信息,BatchedSerializer(PickleSerializer()))
文件“/usr/local/lib/python3.6/site packages/pyspark/serializers.py”,第147行,在load_stream中
屈服自我。读取长度(流)
文件“/usr/local/lib/python3.6/site packages/pyspark/serializers.py”,第172行,长度为
返回自加载(obj)
文件“/usr/local/lib/python3.6/site packages/pyspark/serializers.py”,第580行,装入
返回pickle.load(对象,编码=编码)
文件“/usr/local/lib/python3.6/site packages/pyspark/sql/types.py”,第1396行,在
返回lambda*a:dataType.fromInternal(a)
fromInternal中的文件“/usr/local/lib/python3.6/site packages/pyspark/sql/types.py”,第633行
对于zip中的f、v、c(self.fields、obj、self.\u-needConversion)]
文件“/usr/local/lib/python3.6/site packages/pyspark/sql/types.py”,第633行,在
对于zip中的f、v、c(self.fields、obj、self.\u-needConversion)]
fromInternal中的文件“/usr/local/lib/python3.6/site packages/pyspark/sql/types.py”,第445行
从内部返回self.dataType.fromInternal(obj)
fromInternal中的文件“/usr/local/lib/python3.6/site packages/pyspark/sql/types.py”,第199行
return datetime.datetime.fromtimestamp(ts//1000000).replace(微秒=ts%1000000)
ValueError:第0年超出范围
理想情况下,它永远不应该被写入,因为createdon(datetime)字段的值无效,但这不是行为。我做错什么了吗 你有没有想过?