Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark pyspark应为十进制(16,2),应为二进制_Apache Spark_Pyspark_Parquet - Fatal编程技术网

Apache spark pyspark应为十进制(16,2),应为二进制

Apache spark pyspark应为十进制(16,2),应为二进制,apache-spark,pyspark,parquet,Apache Spark,Pyspark,Parquet,我在试图查看从拼花文件创建的数据框中的数据时遇到以下错误 Expected: decimal(16,2), Found: BINARY at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:221) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$

我在试图查看从拼花文件创建的数据框中的数据时遇到以下错误

Expected: decimal(16,2), Found: BINARY
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:221)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:291)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:283)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:250)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:497)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:220)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:174)
        at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:215)
我使用的是spark 2.4.4。我在谷歌上搜索了一下,但没有找到足够的信息

20/04/30 21:52:18 ERROR TaskSetManager: Task 0 in stage 9.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 380, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o329.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 27, ip-10-157-181-75.extnp.national.com.au, executor 50): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file 
20/04/30 21:52:18错误TaskSetManager:阶段9.0中的任务0失败4次;中止工作
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“/usr/lib/spark/python/pyspark/sql/dataframe.py”,显示第380行
打印(self.\u jdf.showString(n,20,垂直))
文件“/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py”,第1257行,在__
文件“/usr/lib/spark/python/pyspark/sql/utils.py”,第63行,deco格式
返回f(*a,**kw)
文件“/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py”,第328行,在get_return_值中
py4j.protocol.Py4JJavaError:调用o329.showString时出错。
:org.apache.spark.SparkException:作业因阶段失败而中止:阶段9.0中的任务0失败4次,最近的失败:阶段9.0中的任务0.3丢失(TID 27,ip-10-157-181-75.extnp.national.com.au,executor 50):org.apache.spark.sql.execution.QueryExecutionException:无法在文件中转换拼花列
编辑:我们发现这是spark的一个限制。数据类型为Decimal(38,2)的列工作正常

谢谢,
Mc

如果您提供列数据类型定义为decimal的外部架构,并且该列包含二进制值,则这是应该的


您可以做的是将所有列读取为StringType,然后在看到数据帧后强制实施架构。

请添加代码,并在代码行上显示异常thrownHi I have updated my post。这就是你要的吗?谢谢。我们发现,将精度提高到十进制(38,2)是可行的,因为我们有其他使用此数据类型的列,我们没有发现任何问题。在Spark中,我们如何在读取拼花文件时指定外部模式?您必须首先使用stuctType schema=t.StructType([t.StructField('id',StringType(),True),t.StructField('name',StringType(),True)]创建兼容模式,然后在读取数据帧时使用df=Spark.read.schema(schema).parquet(文件路径)明白了,但在我的例子中,我真的不必这样做,因为我正在阅读的拼花地板具有相同的列,具有不同的数据类型。我可以开始使用字符串字段,但我想知道是否有方法读取实际数据类型的字段。其他团队编写这些文件的方式如下:col1:decimal(16,2)(nullable=true)col1_std:string(nullable=true)我想在加载目标时使用string列并强制转换它。再次感谢你的帮助。