Apache spark 配置单元外部表无法查看分区的拼花地板文件_Apache Spark_Hive_Partitioning_Parquet

Apache spark 配置单元外部表无法查看分区的拼花地板文件

apache-spark hive

Apache spark 配置单元外部表无法查看分区的拼花地板文件,apache-spark,hive,partitioning,parquet,Apache Spark,Hive,Partitioning,Parquet,我正在使用Spark生成拼花地板文件（通过setid分区，使用Snappy压缩），并将其存储在HDFS位置 df.coalesce(1).write.partitionBy("SetId"). mode(SaveMode.Overwrite). format("parquet"). option("header","true"). save(args(1)) 拼花地板数据文件存储在/some hdfs path/testsp 然后，我为其创建配置单元表，如下所示： CREATE

我正在使用Spark生成拼花地板文件（通过

setid

分区，使用Snappy压缩），并将其存储在HDFS位置

df.coalesce(1).write.partitionBy("SetId").
  mode(SaveMode.Overwrite).
  format("parquet").
  option("header","true").
  save(args(1))

拼花地板数据文件存储在

/some hdfs path/testsp

然后，我为其创建配置单元表，如下所示：

CREATE EXTERNAL TABLE DimCompany(
  CompanyCode string,
  CompanyShortName string,
  CompanyDescription string,
  BusinessDate string,
  PeriodTypeInd string,
  IrisDuplicateFlag int,
  GenTimestamp timestamp
) partitioned by (SetId int)
STORED AS PARQUET LOCATION '/some-hdfs-path/testsp'
TBLPROPERTIES ('skip.header.line.count'='1','parquet.compress'='snappy');

然而，当我在配置单元中的表上选择时，它不会显示任何结果

我试过：

运行

msck

命令，如：

msck repair table dimcompany;

设置以下各项：

spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")

这些都不起作用，我如何解决这个问题呢？

问题是分区列

SetId

使用大写字母

由于配置单元将其列名转换为小写，因此分区列存储为

setid

，而不是

setid

。因此，当Hive在区分大小写的数据存储中搜索分区/文件夹时，它会查找

setid=some\u value

，但什么也找不到，因为您的数据文件夹的格式是

setid=some\u value

要实现此功能，请将

SetId

转换为小写或蛇形。您可以通过在数据帧中为列添加别名来使用此选项：

df.select(
... {{ your other_columns }} ...,
col("SetId").alias("set_id")
)

在执行create语句之前，您可能还必须基于此设置这些属性

创建表后，还可以尝试运行

msck repair table <your_schema.your_table>;

msck修复表；

我自己也遇到了这个问题-根本原因归功于Anuvrat Singh关于

msck repair table <your_schema.your_table>;