Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/meteor/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Pyspark py spark已成功写入无效日期,但在读取时引发异常_Pyspark - Fatal编程技术网

Pyspark py spark已成功写入无效日期,但在读取时引发异常

Pyspark py spark已成功写入无效日期,但在读取时引发异常,pyspark,Pyspark,我正在使用py spark处理一些数据,并使用拼花格式将处理后的文件写入S3。批处理过程的代码在docker容器(Linux)中的Ec2上运行。该数据还包含一些日期时间字段,我将其另存为TimestampType(在拼花文件中),因为我需要在athena查询中支持该字段。如果此字段的值为“0001-01-01”,则批处理会将其成功写入拼花地板文件,但只有在读取此数据时,才会引发异常。这是linux机器上的行为。 以下是重新编程的示例代码- from pyspark.sql.types impor

我正在使用py spark处理一些数据,并使用拼花格式将处理后的文件写入S3。批处理过程的代码在docker容器(Linux)中的Ec2上运行。该数据还包含一些日期时间字段,我将其另存为TimestampType(在拼花文件中),因为我需要在athena查询中支持该字段。如果此字段的值为“0001-01-01”,则批处理会将其成功写入拼花地板文件,但只有在读取此数据时,才会引发异常。这是linux机器上的行为。 以下是重新编程的示例代码-

from pyspark.sql.types import StructType,StructField,DateType,TimestampType
from dateutil.parser import parse
d=parse('0001-01-01 00:00:00')
data=[{'createdon':d}]
distdata = sc.parallelize(data)
schema = StructType([StructField('createdon',TimestampType())])
df=spark.createDataFrame(distdata,schema)
df.write.parquet("\test-1")
执行此代码后,它会将数据无误地写入文件。当我试图阅读相同的内容时,我会发现下面的错误-

Traceback (most recent call last):                                                                    
  File "<stdin>", line 1, in <module>                                                                 
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 572, in take           
    return self.limit(num).collect()                                                                  
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 535, in collect        
    return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))                  
  File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 147, in load_stream      
    yield self._read_with_length(stream)                                                              
  File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 172, in _read_with_length
    return self.loads(obj)                                                                            
  File "/usr/local/lib/python3.6/site-packages/pyspark/serializers.py", line 580, in loads            
    return pickle.loads(obj, encoding=encoding)                                                       
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/types.py", line 1396, in <lambda>          
    return lambda *a: dataType.fromInternal(a)                                                        
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/types.py", line 633, in fromInternal       
    for f, v, c in zip(self.fields, obj, self._needConversion)]                                       
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/types.py", line 633, in <listcomp>         
    for f, v, c in zip(self.fields, obj, self._needConversion)]                                       
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/types.py", line 445, in fromInternal       
    return self.dataType.fromInternal(obj)                                                            
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/types.py", line 199, in fromInternal       
    return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)           
ValueError: year 0 is out of range  
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“/usr/local/lib/python3.6/site packages/pyspark/sql/dataframe.py”,第572行,在take中
返回self.limit(num.collect())
collect中的文件“/usr/local/lib/python3.6/site packages/pyspark/sql/dataframe.py”,第535行
返回列表(\u从\u套接字加载\u(sock\u信息,BatchedSerializer(PickleSerializer()))
文件“/usr/local/lib/python3.6/site packages/pyspark/serializers.py”,第147行,在load_stream中
屈服自我。读取长度(流)
文件“/usr/local/lib/python3.6/site packages/pyspark/serializers.py”,第172行,长度为
返回自加载(obj)
文件“/usr/local/lib/python3.6/site packages/pyspark/serializers.py”,第580行,装入
返回pickle.load(对象,编码=编码)
文件“/usr/local/lib/python3.6/site packages/pyspark/sql/types.py”,第1396行,在
返回lambda*a:dataType.fromInternal(a)
fromInternal中的文件“/usr/local/lib/python3.6/site packages/pyspark/sql/types.py”,第633行
对于zip中的f、v、c(self.fields、obj、self.\u-needConversion)]
文件“/usr/local/lib/python3.6/site packages/pyspark/sql/types.py”,第633行,在
对于zip中的f、v、c(self.fields、obj、self.\u-needConversion)]
fromInternal中的文件“/usr/local/lib/python3.6/site packages/pyspark/sql/types.py”,第445行
从内部返回self.dataType.fromInternal(obj)
fromInternal中的文件“/usr/local/lib/python3.6/site packages/pyspark/sql/types.py”,第199行
return datetime.datetime.fromtimestamp(ts//1000000).replace(微秒=ts%1000000)
ValueError:第0年超出范围

理想情况下,它永远不应该被写入,因为createdon(datetime)字段的值无效,但这不是行为。我做错什么了吗

你有没有想过?