Hadoop 从配置单元中的每个列中选择不同的值_Hadoop_Mapreduce_Hive_Distinct

Hadoop 从配置单元中的每个列中选择不同的值

hadoop mapreduce hive

Hadoop 从配置单元中的每个列中选择不同的值,hadoop,mapreduce,hive,distinct,Hadoop,Mapreduce,Hive,Distinct,我试图从给定表中的每一列中选择不同的值。由于创建了许多MapReduce作业，我的查询缺乏性能，我正在寻找更好的解决方案我的表格包含以下值： last_30: a last_90: a, b, a last_180: b, c 所需的输出如下： last_30#a last_90#a last_90#b last_180#b last_180#c 使用以下查询，我获得了所需的输出，但性能不是很好，因为它在表中循环了几次： SELECT distinct co

我试图从给定表中的每一列中选择不同的值。由于创建了许多MapReduce作业，我的查询缺乏性能，我正在寻找更好的解决方案

我的表格包含以下值：

last_30: a  
last_90: a, b, a    
last_180: b, c

所需的输出如下：

last_30#a  
last_90#a  
last_90#b   
last_180#b  
last_180#c

使用以下查询，我获得了所需的输出，但性能不是很好，因为它在表中循环了几次：

SELECT distinct concat('last_30', exploded_last_30.key) 
FROM table
LATERAL VIEW explode(last_30) exploded_last_30 AS key
UNION ALL
SELECT distinct concat('last_90', exploded_last_90.key) 
FROM table
LATERAL VIEW explode(last_90) exploded_last_90 AS key
UNION ALL
SELECT distinct concat('last_180', exploded_last_180.key) 
FROM table
LATERAL VIEW explode(last_180) exploded_last_180 AS key

您能想出一种更快的方法来创建所需的输出吗

问候

：：更新：：：

使用您的解决方案，我提出了以下问题：

    select distinct *
    from (
        select explode( map_keys( map(
                                      concat('firstname#',a.exploded_firstname), '1', 
                                      concat('lastname#', a.exploded_lastname), '1', 
                                      concat('gender#', a.exploded_gender), '1',
                                      concat('last_30#', a.exploded_last_30), '1',
                                      concat('last_90#', a.exploded_last_90), '1'   
                                     ) 
                                )  
                      )
        from (
              select
                exploded_firstname.key as exploded_firstname, 
                exploded_lastname.key as exploded_lastname, 
                exploded_gender.key as exploded_gender,
                exploded_last_30.key as exploded_last_30,
                exploded_last_90.key as exploded_last_90
              from table
              LATERAL VIEW explode(firstname) exploded_firstname AS key, value
              LATERAL VIEW explode(lastname) exploded_lastname AS key, value
              LATERAL VIEW explode(gender) exploded_gender AS key, value
              LATERAL VIEW explode(last_30) exploded_last_30 AS key
              LATERAL VIEW explode(last_90) exploded_last_90 AS key
          ) as a 
      ) as b;

但仍面临两个问题：

我没有充分描述这个问题，我提供了样本数据提供的仅包括基本数据类型。在实表中，也存在映射和数组。仅命中数组或贴图包含“NULL”值将完全不返回任何输出

第二，增加更多字段此查询阻止编译器创建MapReduce作业执行请求。以下是14和15个字段的MapReduce时间分别为：

Total MapReduce CPU Time Spent: 26 seconds 60 msec
OK
Time taken: 142.896 seconds

Total MapReduce CPU Time Spent: 29 seconds 310 msec
OK
Time taken: 257.807 seconds

正如您所看到的，MapReduce的总时间近似为线性，而所花费的总时间大大增加。你们对这两个问题有什么建议吗？

工会将强制多次读取该表。为了避免这种情况，您可以使用映射来消除单行的重复数据，然后将其分解（这将使您的数据旋转）。对于重复数据消除，请使用列值作为映射键，并使用常量作为映射值

如果行之间没有重复值，则这将是一次扫描操作：

  select explode( map_keys( map(concat(customer_id, '#', customer_fname), '1'
             , concat(customer_id, '#', customer_lname), '1'
             , concat(customer_id, '#', customer_email), '1'
             , concat(customer_id, '#', customer_street), '1'
             , concat(customer_id, '#', customer_city), '1'
             , concat(customer_id, '#', customer_state), '1'
             , concat(customer_id, '#', customer_zipcode), '1'
      ) ) ) from customers

如果不同行生成了重复项，则添加distinct，但这将强制执行reduce阶段，并且速度会较慢

此外，地图还可用于透视数据：D