使用pyspark“读取拼花地板文件时出错;必填字段';版本';在序列化数据中找不到&引用;

使用pyspark“读取拼花地板文件时出错;必填字段';版本';在序列化数据中找不到&引用;,pyspark,parquet,apache-spark-2.0,Pyspark,Parquet,Apache Spark 2.0,我正在Hadoop集群中使用pyspark 2.0和Hortonworks HDP 2.5。 我尝试阅读拼花地板文件,其中包含: dfsms = spark.read.parquet("/projects/data/parquetfolder") 我可以看到数据的标题并打印几行。但当我尝试以下方法时: dfsms.count() dfsms.describe().show() 我得到以下错误: java.io.IOException: can not read class org.apach

我正在Hadoop集群中使用pyspark 2.0和Hortonworks HDP 2.5。 我尝试阅读拼花地板文件,其中包含:

dfsms = spark.read.parquet("/projects/data/parquetfolder")
我可以看到数据的标题并打印几行。但当我尝试以下方法时:

dfsms.count()
dfsms.describe().show()
我得到以下错误:

java.io.IOException: can not read class org.apache.parquet.format.FileMetaData: Required field 'version' was not found in serialized data! Struct: FileMetaData(version:0, schema:null, num_rows:0, row_groups:null)


    --------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-7-0717f4acccb0> in <module>()
      1 #dfsms.show(10)
      2 
----> 3 dfsms.count()

/usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/dataframe.py in count(self)
    297         2
    298         """
--> 299         return int(self._jdf.count())
    300 
    301     @ignore_unicode_prefix

/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    931         answer = self.gateway_client.send_command(command)
    932         return_value = get_return_value(
--> 933             answer, self.gateway_client, self.target_id, self.name)
    934 
    935         for temp_arg in temp_args:

/usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    310                 raise Py4JJavaError(
    311                     "An error occurred while calling {0}{1}{2}.\n".
--> 312                     format(target_id, ".", name), value)
    313             else:
    314                 raise Py4JError(

Py4JJavaError: An error occurred while calling o46.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 in stage 1.0 (TID 42, worker12.): java.io.IOException: can not read class org.apache.parquet.format.FileMetaData: Required field 'version' was not found in serialized data! Struct: FileMetaData(version:0, schema:null, num_rows:0, row_groups:null)
java.io.IOException:无法读取类org.apache.parquet.format.FileMetaData:在序列化数据中未找到必需字段“version”!结构:FileMetaData(版本:0,架构:null,行数:0,行组:null)
--------------------------------------------------------------------------
Py4JJavaError回溯(最近一次调用)
在()
1#dfsms.show(10)
2.
---->3 dfsms.count()
/usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/dataframe.py计数(self)
297         2
298         """
-->299返回int(self.\u jdf.count())
300
301@ignore\u unicode\u前缀
/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in_uu调用(self,*args)
931 answer=self.gateway\u client.send\u命令(command)
932返回值=获取返回值(
-->933应答,self.gateway_客户端,self.target_id,self.name)
934
935对于临时参数中的临时参数:
/装饰中的usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/utils.py(*a,**kw)
61 def装饰(*a,**千瓦):
62尝试:
--->63返回f(*a,**kw)
64除py4j.protocol.Py4JJavaError外的其他错误为e:
65 s=e.java_exception.toString()
/获取返回值中的usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py(答案、网关客户端、目标id、名称)
310 raise Py4JJavaError(
311“调用{0}{1}{2}时出错。\n”。
-->312格式(目标id,“.”,名称),值)
313其他:
314升起Py4JError(
Py4JJavaError:调用o46.count时出错。
:org.apache.spark.sparkeException:作业因阶段失败而中止:阶段1.0中的任务5失败4次,最近的失败:阶段1.0中的任务5.3丢失(TID 42,worker12.):java.io.IOException:无法读取类org.apache.parquet.format.FileMetaData:在序列化数据中找不到必需字段“version”!Struct:FileMetaData(版本:0,架构:null,行数:0,行组:null)