Hive &引用;拼花地板记录格式不正确;而列计数不是0

Hive &引用;拼花地板记录格式不正确;而列计数不是0,hive,pyspark,amazon-emr,parquet,Hive,Pyspark,Amazon Emr,Parquet,在AWS EMR集群上,我尝试使用Pyspark将查询结果写入parquet,但遇到以下错误: Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead at org.apache.hadoop.hive.ql.io.parquet.write.DataWritabl

在AWS EMR集群上,我尝试使用Pyspark将查询结果写入parquet,但遇到以下错误:

Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
    at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
    at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
    at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
    at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
    at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
    at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
    at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
    at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
    at org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:245)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
    ... 10 more
我曾读到,如果某些列具有,则可能发生这种情况,但在检查所有列计数后,情况并非如此。没有一列是完全空的。我没有使用拼花地板,而是尝试将结果写入文本文件,一切都很顺利

有什么线索可以触发这个错误吗?以下是此表中使用的所有数据类型。总共有51列

'array<bigint>',
'array<char(50)>',
'array<smallint>',
'array<string>',
'array<varchar(100)>',
'array<varchar(50)>',
'bigint',
'char(16)',
'char(20)',
'char(4)',
'int',
'string',
'timestamp',
'varchar(255)',
'varchar(50)',
'varchar(87)'
“数组”,
“数组”,
“数组”,
“数组”,
“数组”,
“数组”,
“bigint”,
"char(16)",,
"char(20)",,
"char(4)",,
“int”,
'字符串',
“时间戳”,
“varchar(255)”,
“varchar(50)”,
‘瓦查尔(87)’

结果表明,拼花地板不支持空阵列。如果表中有一个或多个空数组(任何类型),将触发此错误


一种解决方法是将空数组强制转换为空值。

看起来您正在使用Spark的一个配置单元写入路径(
org.apache.hadoop.Hive.ql.io.parquet.write
)。我能够解决这个问题,而不是直接写入拼花地板,然后在需要的任何配置单元表中添加分区

df.write.parquet(your_path)
spark.sql(f"""
    ALTER TABLE {your_table}
    ADD PARTITION (partition_spec) LOCATION '{your_path}'
    """)

看起来您有空数组(
[]
),如果列混合了
null
值和
[]
值,则尝试将其替换为
null
,是否可以将其显示为空列?这可能是有道理的,我将尝试确保生成拼花的上游和读取拼花的当前作业属于同一个拼花versions@shuvalov这是正确的答案!如果您可以控制文件格式,另一种解决方法是使用而不是
拼花
——可以使用空数组。所有大数据工具都同样支持它。