Scala 从Spark Dataframe写入的拼花文件似乎已损坏
我使用Spark将数据写入拼花地板文件,根据AWS Kinesis小时分区,以每小时一次的方式读取AWS Kinesis输出的数据 写入时,我按年/月/日/小时/事件类型对输出的数据进行分区,然后追加并保存到S3:Scala 从Spark Dataframe写入的拼花文件似乎已损坏,scala,apache-spark,amazon-s3,apache-spark-sql,parquet,Scala,Apache Spark,Amazon S3,Apache Spark Sql,Parquet,我使用Spark将数据写入拼花地板文件,根据AWS Kinesis小时分区,以每小时一次的方式读取AWS Kinesis输出的数据 写入时,我按年/月/日/小时/事件类型对输出的数据进行分区,然后追加并保存到S3: fooDf .withColumn("timestamp_new", (col("timestamp").cast("timestamp"))) .drop("timestamp") .withColumnRenamed("timestamp_new", "timesta
fooDf
.withColumn("timestamp_new", (col("timestamp").cast("timestamp")))
.drop("timestamp")
.withColumnRenamed("timestamp_new", "timestamp")
.withColumn("year", year(col("timestamp")))
.withColumn("month", month(col("timestamp")))
.withColumn("day", dayofmonth(col("timestamp")))
.withColumn("hour", hour(col("timestamp")))
.write
.option("mode", "DROPMALFORMED")
.mode("overwrite")
.partitionBy("year", "month", "day", "hour", "eventType")
.parquet("s3://foo/bar/foobar")
,但问题出现在读取时,我得到了不兼容的数据类型,即使Parquet应该处理模式更新。问题是:
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2200)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3383)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
at org.apache.spark.sql.Dataset.show(Dataset.scala:745)
at org.apache.spark.sql.Dataset.show(Dataset.scala:704)
... 85 elided
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Encounter error while reading parquet files. One possible cause: Parquet column cannot be converted in the corresponding files. Details:
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:193)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
... 3 more
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file s3://foo/bar/foobar/year=2019/month=9/day=5/hour=22/eventType=barbarbar/part-rawr-c000.snappy.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
... 22 more
Caused by: java.lang.ClassCastException: Expected instance of group converter but got "org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetStringConverter"
at org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:34)
at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:267)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
... 26 more
这是一个常见问题,因为读取Spark时无法确定eventType(例如event=Barbar)的数据类型 在spark submit或代码中,在读取文件之前设置以下内容 spark.conf.setspark.sql.sources.partitionColumnTypeInference.enabled,false
或者使用模式读取。这是一个常见问题,因为读取Spark时无法确定eventType(例如event=Barbar)的数据类型 在spark submit或代码中,在读取文件之前设置以下内容 spark.conf.setspark.sql.sources.partitionColumnTypeInference.enabled,false
或者用模式阅读。我尝试了这个,但仍然得到了相同的结果:原因:java.lang.ClassCastException:预期的组转换器实例,但得到了org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetStringConverter@EricMeadows,为什么要按列“eventType”进行分区。按年、月、月分区,day,hour,eventType,但stacktrace显示由以下原因引起的路径=>org.apache.parquet.io.ParquetDecodingException:无法读取文件s3://foo/bar/foobar/year=2019/month=9/day=5/hour=22/event=barbarbarbar/part-rawr-c00.snapy.parquet e.g event=barbarbarbarbarbar中块-1中0处的值?如果分区列名搞乱了,Spark dataframe reader怎么能从更高级别的s3前缀中读取它呢?在写入时,我解析了Kinesis流,并使用列eventType,它没有损坏并正确地读入DF,然后我使用它进行分区。当检查该文件时,它看起来很好&没有损坏。为什么有一个s3前缀作为event=,如果您是按eventType分区的,那么前缀应该是eventType=!由于敏感信息,我更改了一些输出。我无意中互换使用了event和eventType=这两个选项。我尝试了这一点,但仍然得到了相同的选项:起因:java.lang.ClassCastException:预期的组转换器实例,但得到了org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetStringConverter@EricMeadows ,为什么要按列“eventType”进行分区。按年、月、日、小时进行分区,eventType但stacktrace显示一个路径=>由以下原因引起:org.apache.parquet.io.ParquetDecodingException:无法读取文件s3://foo/bar/foobar/year=2019/month=9/day=5/hour=22/event=barbarbarbar/part-rawr-c000.snapy.parquet e.g event=barbarbarbarbarbar中块-1中0处的值?如果分区列名搞乱了,Spark dataframe reader怎么能从更高级别的s3前缀中读取它呢?在写入时,我解析了Kinesis流,并使用列eventType,它没有损坏并正确地读入DF,然后我使用它进行分区。当检查该文件时,它看起来很好&没有损坏。为什么有一个s3前缀作为event=,如果您是按eventType分区的,那么前缀应该是eventType=!由于敏感信息,我更改了一些输出。我意外地将event和eventType=互换使用。请尝试加载一个文件,该文件是s3://foo/bar/foobar/year=2019/month=9/day=5/hour=22/event=barbarbar/part-rawr-c000.snapy.parquet。并查看是否加载数据。看起来该文件已损坏。@Prateek-加载正常:。源中的保存模式是什么?我怀疑有一些数据类型不匹配。在失败位置或其他文件中。这将是一个乏味的过程,但您必须加载几个文件并检查架构类型。@Prateek-当我一次加载所有数据而不是按分区加载并写入时,它工作得很好。当我加载每个文件夹并使用模式append写入同一目标时,它失败。@EricMeadows您是否碰巧解决了此问题?尝试加载单个文件,该文件正在抱怨ie s3://foo/bar/foobar/year=2019/month=9/day=5/hour=22/event=barbarbarbar/part-rawr-c000.snapy.parquet。并查看是否加载数据。看起来该文件已损坏。@Prateek-加载正常:。源中的保存模式是什么?我怀疑有一些数据类型不匹配。在失败位置或其他文件中。这将是一个乏味的过程,但您必须加载几个文件并检查架构类型。@Prateek-当我一次加载所有数据而不是按分区加载并写入时,它工作得很好。当我加载每个文件夹并使用mode append将其写入同一目标时,它失败。@Ericmeados您是否碰巧解决了此问题?