Hadoop 获取Apache Pig中每N个元组的平均值
假设我有一个包含两列的表custype和AMOUNT。我想添加第三列NTILE,然后将其分组并使用以获得平均值,如下所示:Hadoop 获取Apache Pig中每N个元组的平均值,hadoop,apache-pig,quantile,Hadoop,Apache Pig,Quantile,假设我有一个包含两列的表custype和AMOUNT。我想添加第三列NTILE,然后将其分组并使用以获得平均值,如下所示: CUSTTYPE | AMOUNT | NTILE ----------+---------+---------- RETAIL | 78.00 | 1 RETAIL | 234.00 | 1 RETAIL | 249.00 | 1 RETAIL | 278.00 | 2 RETAIL | 392.00 | 2
CUSTTYPE | AMOUNT | NTILE
----------+---------+----------
RETAIL | 78.00 | 1
RETAIL | 234.00 | 1
RETAIL | 249.00 | 1
RETAIL | 278.00 | 2
RETAIL | 392.00 | 2
RETAIL | 498.00 | 2
RETAIL | 500.00 | 3
RETAIL | 738.00 | 3
RETAIL | 1250.00 | 3
RETAIL | 2029.00 | 4
RETAIL | 2393.00 | 4
RETAIL | 3933.00 | 4
本质上,我试图取每n项的平均值(这里,n=3):
从Pig参考中,似乎可以使用
Over()
实现这一点,但我找不到如何实现这一点的示例。想法?您可以使用rank
操作符对数据的每条记录进行排序:
像这样:
A = LOAD 'path' AS (schema);
B = RANK A;
然后将每个等级除以3:
C = FOREACH B generate ($0 + 1) / 3 as NTILE, CUSTTYPE, AMOUNT;
C = FOREACH B generate ($0 + 1) / 3 as NTILE, CUSTTYPE, AMOUNT;