使用pyspark“读取拼花地板文件时出错;必填字段';版本';在序列化数据中找不到&引用;
我正在Hadoop集群中使用pyspark 2.0和Hortonworks HDP 2.5。 我尝试阅读拼花地板文件,其中包含:使用pyspark“读取拼花地板文件时出错;必填字段';版本';在序列化数据中找不到&引用;,pyspark,parquet,apache-spark-2.0,Pyspark,Parquet,Apache Spark 2.0,我正在Hadoop集群中使用pyspark 2.0和Hortonworks HDP 2.5。 我尝试阅读拼花地板文件,其中包含: dfsms = spark.read.parquet("/projects/data/parquetfolder") 我可以看到数据的标题并打印几行。但当我尝试以下方法时: dfsms.count() dfsms.describe().show() 我得到以下错误: java.io.IOException: can not read class org.apach
dfsms = spark.read.parquet("/projects/data/parquetfolder")
我可以看到数据的标题并打印几行。但当我尝试以下方法时:
dfsms.count()
dfsms.describe().show()
我得到以下错误:
java.io.IOException: can not read class org.apache.parquet.format.FileMetaData: Required field 'version' was not found in serialized data! Struct: FileMetaData(version:0, schema:null, num_rows:0, row_groups:null)
--------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-7-0717f4acccb0> in <module>()
1 #dfsms.show(10)
2
----> 3 dfsms.count()
/usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/dataframe.py in count(self)
297 2
298 """
--> 299 return int(self._jdf.count())
300
301 @ignore_unicode_prefix
/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934
935 for temp_arg in temp_args:
/usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JJavaError(
311 "An error occurred while calling {0}{1}{2}.\n".
--> 312 format(target_id, ".", name), value)
313 else:
314 raise Py4JError(
Py4JJavaError: An error occurred while calling o46.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 in stage 1.0 (TID 42, worker12.): java.io.IOException: can not read class org.apache.parquet.format.FileMetaData: Required field 'version' was not found in serialized data! Struct: FileMetaData(version:0, schema:null, num_rows:0, row_groups:null)
java.io.IOException:无法读取类org.apache.parquet.format.FileMetaData:在序列化数据中未找到必需字段“version”!结构:FileMetaData(版本:0,架构:null,行数:0,行组:null)
--------------------------------------------------------------------------
Py4JJavaError回溯(最近一次调用)
在()
1#dfsms.show(10)
2.
---->3 dfsms.count()
/usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/dataframe.py计数(self)
297 2
298 """
-->299返回int(self.\u jdf.count())
300
301@ignore\u unicode\u前缀
/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in_uu调用(self,*args)
931 answer=self.gateway\u client.send\u命令(command)
932返回值=获取返回值(
-->933应答,self.gateway_客户端,self.target_id,self.name)
934
935对于临时参数中的临时参数:
/装饰中的usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/utils.py(*a,**kw)
61 def装饰(*a,**千瓦):
62尝试:
--->63返回f(*a,**kw)
64除py4j.protocol.Py4JJavaError外的其他错误为e:
65 s=e.java_exception.toString()
/获取返回值中的usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py(答案、网关客户端、目标id、名称)
310 raise Py4JJavaError(
311“调用{0}{1}{2}时出错。\n”。
-->312格式(目标id,“.”,名称),值)
313其他:
314升起Py4JError(
Py4JJavaError:调用o46.count时出错。
:org.apache.spark.sparkeException:作业因阶段失败而中止:阶段1.0中的任务5失败4次,最近的失败:阶段1.0中的任务5.3丢失(TID 42,worker12.):java.io.IOException:无法读取类org.apache.parquet.format.FileMetaData:在序列化数据中找不到必需字段“version”!Struct:FileMetaData(版本:0,架构:null,行数:0,行组:null)