Apache spark Spark正在忽略配置单元表的bucketing设置_Apache Spark

Apache spark Spark正在忽略配置单元表的bucketing设置

apache-spark

Apache spark Spark正在忽略配置单元表的bucketing设置,apache-spark,Apache Spark,我正在S3上使用一个1TB大小的数据集。数据在拼花文件中。执行以下代码后，在每个分区中创建了许多文件，但编号不正确（6）当我尝试从Presto查询它时，它抛出以下异常： Query 20180820_074141_00004_46w5b failed: Hive table 'db.test_orc_opt_1' is corrupt. The number of files in the directory (13) does not match the declared bucket co

我正在S3上使用一个1TB大小的数据集。数据在拼花文件中。执行以下代码后，在每个分区中创建了许多文件，但编号不正确（6）

当我尝试从Presto查询它时，它抛出以下异常：

Query 20180820_074141_00004_46w5b failed: Hive table 'db.test_orc_opt_1' is corrupt. The number of files in the directory (13) does not match the declared bucket count (6) for partition: departure_date_year_month_int=201208

有没有办法强制火花扣合

Spark版本2.3.1

尝试更改

.bucketBy(6, "departure_date_year")

到

您正在使用哪种版本的spark？spark bucketing与Hive bucketing不同。使用配置单元插入表格而不是Spark

请看第42页

请检查。原木拼花地板的方案是什么？

.bucketBy(6, "departure_date_year")

.bucketBy(13, "departure_date_year")