Hive 配置单元结果另存为拼花文件_Hive_Parquet_Snappy

Hive 配置单元结果另存为拼花文件

hive

Hive 配置单元结果另存为拼花文件,hive,parquet,snappy,Hive,Parquet,Snappy,我正在尝试从配置单元表创建snappy.parquet文件。这是一张大分区桌子，只需要一小部分。这样做： set parquet.compression=SNAPPY; set hive.exec.compress.output=true; set hive.exec.compress.intermediate=true; set hive.exec.parallel=true; set mapred.output.compress=true; set mapreduce.output.file

我正在尝试从配置单元表创建snappy.parquet文件。这是一张大分区桌子，只需要一小部分。这样做：

set parquet.compression=SNAPPY;
set hive.exec.compress.output=true;
set hive.exec.compress.intermediate=true;
set hive.exec.parallel=true;
set mapred.output.compress=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapred.compress.map.output=true;
set mapreduce.map.output.compress=true;
set mapred.output.compression.type=BLOCK;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
set io.seqfile.compression.type = BLOCK;
insert overwrite directory 'EXTERNAL_DIRECTORY' STORED AS PARQUET select * from SOURCE_TABLE;

它使用以下架构创建0000000文件：

message hive_schema {
optional int32 _col0;
optional binary _col1 (UTF8);
optional binary _col2 (UTF8);
optional binary _col3 (UTF8);
optional binary _col4 (UTF8);
optional binary _col5 (UTF8);
optional binary _col6 (UTF8);
optional binary _col7 (UTF8);
optional int64 _col8;
optional int64 _col9;
optional int64 _col10;
)

从源_表中删除所有列名。如何正确保存它，以便以后可以将其用作配置单元表？

我将通过从您要查找的源分区中选择所有数据，为您的数据集创建一个新的外部表。然后您将拥有一个可以利用的表和文件。到目前为止，对于外部表，您不能使用CREATETABLEASSELECT语句，因此您需要先创建表，然后将数据加载到其中

create external table yourNewTable ( use your source table DDL...)
  stored as parquet location '/yourNewLocation';

insert into yourNewTable
  select * from yourSourceTable where yourPartitionedFieldNames = 'desiredPartitionName';

因此，听起来您正试图创建一个新的数据集，它只从一个较大的源数据集的一个分区中获取数据，对吗？如果是这样的话，我将创建一个新的外部配置单元表，它将从您要查找的特定表中选择所有数据。然后您将有一个表和一个目录/文件可供使用。我将首先尝试创建新配置单元表并选择到itI中。我收到以下错误：

执行配置单元查询时出错：编译语句时出错：失败：SemanticException[error 10071]：不允许插入外部表

尝试插入覆盖表时出错。我做错了什么吗？请确保属性集hive.insert.into.external.tables=true