Hadoop 如何在配置单元中转置/透视数据?

Hadoop 如何在配置单元中转置/透视数据?,hadoop,hive,Hadoop,Hive,我知道在蜂巢中没有直接的方法来转换数据。我接着问了这个问题:,但由于没有最终的答案,我无法一路得到答案 这是我的表格: | ID | Code | Proc1 | Proc2 | | 1 | A | p | e | | 2 | B | q | f | | 3 | B | p | f | | 3 | B |

我知道在蜂巢中没有直接的方法来转换数据。我接着问了这个问题:,但由于没有最终的答案,我无法一路得到答案

这是我的表格:

 | ID   |   Code   |  Proc1   |   Proc2 | 
 | 1    |    A     |   p      |   e     | 
 | 2    |    B     |   q      |   f     |
 | 3    |    B     |   p      |   f     |
 | 3    |    B     |   q      |   h     |
 | 3    |    B     |   r      |   j     |
 | 3    |    C     |   t      |   k     |
这里Proc1可以有任意数量的值。ID、代码和Proc1共同构成此表的唯一键。我希望对该表进行透视/转置,以便Proc1中的每个唯一值成为一个新列,Proc2中的相应值是该列中对应行的值。在本质上,我试图得到如下东西:

 | ID   |   Code   |  p   |   q |  r  |   t |
 | 1    |    A     |   e  |     |     |     |
 | 2    |    B     |      |   f |     |     |
 | 3    |    B     |   f  |   h |  j  |     |
 | 3    |    C     |      |     |     |  k  |
在新转换的表中,ID和代码是唯一的主键。从我上面提到的罚单中,我可以使用to_map UDAF走到这一步。(免责声明-这可能不是朝着正确方向迈出的一步,但仅在此提及,如果是)

但是我不知道如何从这一步转到我想要的透视/转置表。 任何关于如何进行的帮助都将是巨大的!
谢谢。

我还没有写这段代码,但我认为您可以使用klouts brickhouse提供的一些自定义项:

具体来说,您可以使用他们的collect,如下所述:


然后使用本文详述的方法分解阵列(它们的长度不同)

以下是我最终使用的解决方案:

add jar brickhouse-0.7.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION collect AS 'brickhouse.udf.collect.CollectUDAF';

select 
    id, 
    code,
    group_map['p'] as p,
    group_map['q'] as q,
    group_map['r'] as r,
    group_map['t'] as t
    from ( select
        id, code,
        collect(proc1,proc2) as group_map 
        from test_sample 
        group by id, code
    ) gm;

to_map UDF是从brickhouse repo中使用的:

以下是我使用hive的内部UDF函数“map”解决此问题的方法:

“concat_ws”和“map”是配置单元udf,“collect_list”是配置单元udaf。

另一种解决方案

使用
映射功能旋转

SELECT
  uid,
  kv['c1'] AS c1,
  kv['c2'] AS c2,
  kv['c3'] AS c3
FROM (
  SELECT uid, to_map(key, value) kv
  FROM vtable
  GROUP BY uid
) t

油c1 c2 c3
101  11  12  13
102  21  22  23

Unpivot

SELECT t1.uid, t2.key, t2.value
FROM htable t1
LATERAL VIEW explode (map(
  'c1', c1,
  'c2', c2,
  'c3', c3
)) t2 as key, value

uid键值
101 c1 11
101 c2 12
101 c3 13
102 c1 21
102 c2 22
102 c3 23

如果是数值,您可以使用以下配置单元查询:

样本数据

ID  cust_freq   Var1    Var2    frequency
220444  1   16443   87128   72.10140547
312554  6   984 7339    0.342452643
220444  3   6201    87128   9.258396518
220444  6   47779   87128   2.831972441
312554  1   6055    7339    82.15209213
312554  3   12868   7339    4.478333954
220444  2   6705    87128   15.80822558
312554  2   37432   7339    13.02712127

select id, sum(a.group_map[1]) as One, sum(a.group_map[2]) as Two, sum(a.group_map[3]) as Three, sum(a.group_map[6]) as Six from
( select id, 
 map(cust_freq,frequency) as group_map 
 from table
 ) a group by a.id having id in 
( '220444',
'312554');

ID  one two three   six
220444  72.10140547 15.80822558 9.258396518 2.831972441
312554  82.15209213 13.02712127 4.478333954 0.342452643

In above example I have't used any custom udf. It is only using in-built hive functions.
Note :For string value in key write the vale as sum(a.group_map['1']) as One.

下面也是一个支点

SELECT TM1_Code, Product, Size, State_code, Description
  , Promo_date
  , Price
FROM (
SELECT TM1_Code, Product, Size, State_code, Description
   , MAP('FY2018Jan', FY2018Jan, 'FY2018Feb', FY2018Feb, 'FY2018Mar', FY2018Mar, 'FY2018Apr', FY2018Apr
        ,'FY2018May', FY2018May, 'FY2018Jun', FY2018Jun, 'FY2018Jul', FY2018Jul, 'FY2018Aug', FY2018Aug
        ,'FY2018Sep', FY2018Sep, 'FY2018Oct', FY2018Oct, 'FY2018Nov', FY2018Nov, 'FY2018Dec', FY2018Dec) AS tmp_column
FROM CS_ME_Spirits_30012018) TmpTbl
LATERAL VIEW EXPLODE(tmp_column) exptbl AS Promo_date, Price;

对于Unpivot,我们可以简单地使用以下逻辑

SELECT Cost.Code, Cost.Product, Cost.Size
, Cost.State_code, Cost.Promo_date, Cost.Cost, Sales.Price
FROM
(Select Code, Product, Size, State_code, Promo_date, Price as Cost
FROM Product
Where Description = 'Cost') Cost
JOIN
(Select Code, Product, Size, State_code, Promo_date, Price as Price
FROM Product
Where Description = 'Sales') Sales
on (Cost.Code = Sales.Code
and Cost.Promo_date = Sales.Promo_date);

您可以使用case语句和collect_set的一些帮助来实现这一点。你可以看看这个。您可以在以下位置查看详细答案-

这是查询供参考

SELECT resource_id,
CASE WHEN COLLECT_SET(quarter_1)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_1)[0] END AS quarter_1_spends,
CASE WHEN COLLECT_SET(quarter_2)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_2)[0] END AS quarter_2_spends,
CASE WHEN COLLECT_SET(quarter_3)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_3)[0] END AS quarter_3_spends,
CASE WHEN COLLECT_SET(quarter_4)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_4)[0] END AS quarter_4_spends
FROM (
SELECT resource_id,
CASE WHEN quarter='Q1' THEN amount END AS quarter_1,
CASE WHEN quarter='Q2' THEN amount END AS quarter_2,
CASE WHEN quarter='Q3' THEN amount END AS quarter_3,
CASE WHEN quarter='Q4' THEN amount END AS quarter_4
FROM billing_info)tbl1
GROUP BY resource_id;
  • 我使用下面的查询创建了一个名为hive的虚拟表-
  • 创建表配置单元(id Int、code字符串、Proc1字符串、Proc2字符串)

  • 已加载表中的所有数据-
  • 现在使用下面的查询来实现输出

  • 谢谢你的来信。我不需要collect UDAF,因为它与我在这里已经使用的Map Aggregation UDAF相同。我也可以通过将地图聚合中的键名用作新列来实现这一点,真正的问题是我希望它是动态的-即-我不知道最终可能会得到多少不同的“Proc1”值,我希望为每个新的“Proc1”动态创建更多的列,我不希望在brickhouse repo中映射UDF。你能提供更多关于这方面的细节吗?这里是我看到的你可以使用“收集”UDAF-这类似于“收集”链接:你应该用“收集”替换“收集”我已经用相同的方法更新了解决方案。嗨!我正在尝试做类似的事情。在您的答案中,您有组映射['p']等,表明您提前知道这些值。您如何解决不知道Proc1中的值是什么的问题?请分享。谢谢你能解释一下当值是数字格式时你会怎么做吗。我看到了你的博客,但是代码与表不匹配。这个例子是否可以推广到有多个列需要关注的情况?是否可以推广到除了“p”、“q”、“r”、“t”之外还有其他值的情况?如果你能解释你的答案,这会很有帮助。
    SELECT Cost.Code, Cost.Product, Cost.Size
    , Cost.State_code, Cost.Promo_date, Cost.Cost, Sales.Price
    FROM
    (Select Code, Product, Size, State_code, Promo_date, Price as Cost
    FROM Product
    Where Description = 'Cost') Cost
    JOIN
    (Select Code, Product, Size, State_code, Promo_date, Price as Price
    FROM Product
    Where Description = 'Sales') Sales
    on (Cost.Code = Sales.Code
    and Cost.Promo_date = Sales.Promo_date);
    
    SELECT resource_id,
    CASE WHEN COLLECT_SET(quarter_1)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_1)[0] END AS quarter_1_spends,
    CASE WHEN COLLECT_SET(quarter_2)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_2)[0] END AS quarter_2_spends,
    CASE WHEN COLLECT_SET(quarter_3)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_3)[0] END AS quarter_3_spends,
    CASE WHEN COLLECT_SET(quarter_4)[0] IS NULL THEN 0 ELSE COLLECT_SET(quarter_4)[0] END AS quarter_4_spends
    FROM (
    SELECT resource_id,
    CASE WHEN quarter='Q1' THEN amount END AS quarter_1,
    CASE WHEN quarter='Q2' THEN amount END AS quarter_2,
    CASE WHEN quarter='Q3' THEN amount END AS quarter_3,
    CASE WHEN quarter='Q4' THEN amount END AS quarter_4
    FROM billing_info)tbl1
    GROUP BY resource_id;
    
    insert into hive values('1','A','p','e');
    insert into hive values('2','B','q','f'); 
    insert into hive values('3','B','p','f');
    insert into hive values('3','B','q','h');
    insert into hive values('3','B','r','j');
    insert into hive values('3','C','t','k');
    
    select id,code,
         case when collect_list(p)[0] is null then '' else collect_list(p)[0] end as p,
         case when collect_list(q)[0] is null then '' else collect_list(q)[0] end as q,
         case when collect_list(r)[0] is null then '' else collect_list(r)[0] end as r,
         case when collect_list(t)[0] is null then '' else collect_list(t)[0] end as t
         from(
                select id, code,
                case when proc1 ='p' then proc2 end as p,
                case when proc1 ='q' then proc2 end as q,
                case when proc1 ='r' then proc2 end as r,
                case when proc1 ='t' then proc2 end as t
                from hive
            ) dummy group by id,code;