Sql 配置单元查询逻辑与优化_Sql_Hadoop_Hive_Hdfs_Hiveql

Sql 配置单元查询逻辑与优化

sql hadoop hive

Sql 配置单元查询逻辑与优化,sql,hadoop,hive,hdfs,hiveql,Sql,Hadoop,Hive,Hdfs,Hiveql,我有以下格式的数据：输入输出： **ID col1 col2 col3** 1 C1_abc C1_xce C1_fde 2 C1_sds C1_hhh null 3 C1_aaa null null 4 C1_asw C1_eee C1_ttt 我想使用配置单元脚本实现这一点。我知道多种方法，但需要最优化的方法，因为数据量很大。只需使用条件聚合： select i

我有以下格式的数据：

输入

输出：

**ID    col1    col2      col3**
1     C1_abc     C1_xce    C1_fde      
2     C1_sds     C1_hhh    null
3     C1_aaa     null      null
4     C1_asw     C1_eee    C1_ttt

我想使用配置单元脚本实现这一点。我知道多种方法，但需要最优化的方法，因为数据量很大。

只需使用条件聚合：

select id,
       max(case when rank = 1 then col1 end) as col1,
       max(case when rank = 2 then col1 end) as col2,
       max(case when rank = 3 then col1 end) as col3
from t
where t1.rank in (1, 2, 3)
group by id;

另一种选择是多路连接：

select t1.id, t1.col1, t2.col1 as col2, t3.col1 as col3
from t t1 left join
     t t2
     on t1.rank = 1 and t2.rank = 2 and t1.id = t2.id left join
     t t3
     on t1.id = t3.id and t3.rank = 3;

您可能需要同时尝试这两种方法，以查看哪种方法运行得更快。根据您的数据，它可能会有所不同。

第一个选项很有魅力。在问这个问题之前，我已经使用了第二个选项，但它没有优化，需要40分钟才能完成。这是在60秒内完成的。谢谢你，戈登。

select t1.id, t1.col1, t2.col1 as col2, t3.col1 as col3
from t t1 left join
     t t2
     on t1.rank = 1 and t2.rank = 2 and t1.id = t2.id left join
     t t3
     on t1.id = t3.id and t3.rank = 3;