Apache spark EMR 5.28无法在s3上加载拼花文件

Apache spark EMR 5.28无法在s3上加载拼花文件,apache-spark,apache-spark-sql,amazon-emr,parquet,Apache Spark,Apache Spark Sql,Amazon Emr,Parquet,在EMR cluster 5.28.0上,从s3读取拼花地板文件失败,但出现以下异常,而在EMR 5.18.0上,同样的工作正常。 以下是EMR 5.28.0上的stacktrace 我甚至试过使用火花壳: sqlContext.read.load(("s3://s3_file_path/*") df.take(5) 但失败的情况与此相同: Job aborted due to stage failure: Task 3 in stage 1.0 failed 4 times, most r

在EMR cluster 5.28.0上,从s3读取拼花地板文件失败,但出现以下异常,而在EMR 5.18.0上,同样的工作正常。 以下是EMR 5.28.0上的stacktrace

我甚至试过使用火花壳:

sqlContext.read.load(("s3://s3_file_path/*")
df.take(5) 
但失败的情况与此相同:

Job aborted due to stage failure: Task 3 in stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage 1.0 (TID 17, ip-x.x.x.x.ec2.internal, executor 1): **org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://somedir/somesubdir/434560/1658_1564419581.parquet, range: 0-7928, partition values: [empty row], isDataPresent: false**
    at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.next(AsyncFileDownloader.scala:142)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.getNextFile(FileScanRDD.scala:241)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:171)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
**Caused by: java.lang.NullPointerException
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.org$apache$spark$sql$execution$datasources$parquet$ParquetFileFormat$$isCreatedByParquetMr(ParquetFileFormat.scala:352)
    at** org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:676)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:579)
    at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.org$apache$spark$sql$execution$datasources$AsyncFileDownloader$$downloadFile(AsyncFileDownloader.scala:93)
    at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:73)
    at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:72)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    ... 3 more
我无法找到此文档。是否有人在EMR 5.28.0上遇到过此问题,并且能够解决此问题

在5.28上,我能够读取由EMR写入s3的文件,但读取由parquet go写入的现有parquet文件会抛出上述异常,而在EMR 5.18上它工作正常

更新: 在检查拼花地板文件时,仅适用于5.18的旧文件缺少统计信息

creator:            null 
file schema:        parquet-go-root 
timestringhr:        BINARY SNAPPY DO:0 FPO:21015 SZ:1949/25676/13.17 VC:1092 ENC:RLE,BIT_PACKED,PLAIN ST:[no stats for this column]
timeseconds:         INT64 SNAPPY DO:0 FPO:22964 SZ:1397/9064/6.49 VC:1092 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 1564419460, max: 1564419581, num_nulls not defined]
其中,同时适用于EMR 5.18和5.28的

creator:            parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) 
extra:              org.apache.spark.sql.parquet.row.metadata = {<schema_here>}    
timestringhr:        BINARY SNAPPY DO:0 FPO:3988 SZ:156/152/0.97 VC:1092 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 2019-07-29 16:00:00, max: 2019-07-29 16:00:00, num_nulls: 0]
timeseconds:         INT64 SNAPPY DO:0 FPO:4144 SZ:954/1424/1.49 VC:1092 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 1564419460, max: 1564419581, num_nulls: 0]
创建者:拼花地板mr版本1.10.0(构建031a6654009e3b82020012a18434c582bd74c73a)
额外:org.apache.spark.sql.parquet.row.metadata={}
timestringhr:BINARY SNAPPY DO:0 FPO:3988 SZ:156/152/0.97 VC:1092 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[最小值:2019-07-29 16:00:00,最大值:2019-07-29 16:00:00,num_nulls:0]
时间秒:INT64 SNAPPY DO:0 FPO:4144 SZ:954/1424/1.49 VC:1092 ENC:PLAIN_字典,RLE,位压缩ST:[最小值:1564419460,最大值:1564419581,数值为空:0]

这可能会导致NullPointerException。发现parquet-mr存在相关问题。我可以尝试在类路径中包含parquet的更新版本,或者在EMR 6 beta上进行测试,以查看是否解决了此问题。

尝试将由创建的
值添加到页脚。我追踪了一个NPE到由check-in Spark创建的页脚/U。如果您使用的是<代码> XITONSIS/PLACE,,请考虑以下事项:

var writer_version = "parquet-go version 1.0"
...

...
pw, err := writer.NewJSONWriter(schemaStr, fw, 4)
pw.Footer.CreatedBy = &writer_version

从EMR-5.20.0升级到EMR-5.29.0后,我遇到了一个类似的问题。我只想说明一下,在Go代码中添加pw.Footer.CreatedBy是有效的