Apache pig “我该怎么做？”；“重新分组”；猪亲戚？_Apache Pig

Apache pig “我该怎么做？”；“重新分组”；猪亲戚？

apache-pig

Apache pig “我该怎么做？”；“重新分组”；猪亲戚？,apache-pig,Apache Pig,假设我有一个输入文件input.dat，如下所示： apples 10 oranges 30 apples 6 pears 5 现在，当我加载、分组和投影数据时： sources = LOAD 'input.dat' as { a:chararray, b:int }; grouped = GROUP sources BY a; projection = foreach sources generate flatten(group), SUM(sources.b); dump projecti

假设我有一个输入文件

input.dat

，如下所示：

apples 10
oranges 30
apples 6
pears 5

现在，当我加载、分组和投影数据时：

sources = LOAD 'input.dat' as { a:chararray, b:int };
grouped = GROUP sources BY a;
projection = foreach sources generate flatten(group), SUM(sources.b);
dump projection;

我得到以下信息：

apples 16
oranges 30
pears 5

现在，我想将

和（sources.b）

低于某个阈值的数据“重新分组”到一行中。例如，如果阈值为20，我将得到：

other 21
oranges 30

因为“苹果”和“梨”的总和都低于20

在我看来，我可以遵循两种不同的方法：

使用

grouped

上的

SPLIT

操作符创建两个关系：

高于\u阈值

和

低于\u阈值

。然后将

投影到\u阈值以下

以将

的值替换为“其他”并重新组合。最后，

UNION

与

高于\u阈值的结果一起，然后再次运行最终投影


或者，完全按照原始脚本操作，但在创建投影时，有条件地生成a
（基于SUM（sources.b）
），然后重新分组投影（将所有“其他”行分组在一起），然后再次投影（将重新分组的数据展平）

上述方法之一是否明显优于另一种？还是有其他更有效或更易于维护的方法？
选项1更好。这是因为选项1只需将低于\u阈值的数据传递到M/R记录计数中；在选项2中，您正在重新组合所有内容，看起来像
此外，方法1还有一些优点，最显著的是：

below_threshold
计数将非常快，因为您只需要一个减速机，而组合器将只用一个键就能创造奇迹
根据您的应用程序，您不需要UNION
。您可以只输出到两个位置，然后通过将它们视为来自pig的相同外部输出来“联合”。例如，您仍然可以执行hadoop fs-getmerge my_out/*/part-r-*output
来获取这两个输出

所以，我看到你的猪脚本看起来像：
sources = LOAD 'input.dat' as { a:chararray, b:int };
grouped = GROUP sources BY a;
projection = foreach sources generate flatten(group) as n, SUM(sources.b) as s;
SPLIT projection into above_threshold if s >= 20, below_threshold if s < 20;
dump above_threshold;

below_grouped = GROUP below_threshold BY 'other' PARALLEL 1;
below_projection = FOREACH below_grouped GENERATE group, SUM(below_threshold.s);
dump below_projection;

sources=LOAD'input.dat'为{a:chararray，b:int}；
分组=按a分组源；
投影=每个源生成展平（组）作为n，和（源.b）作为s；
如果s>=20，则将投影分为高于_阈值，如果s<20，则分为低于_阈值；
高于_阈值的转储；
低于_分组=低于_阈值的组由“其他”平行1；
低于\u投影=低于\u分组的每个生成组，总和（低于\u阈值.s）；
在_投影下方倾倒；
选项1更好。这是因为选项1只需将低于\u阈值的数据传递到M/R记录计数中；在选项2中，您正在重新组合所有内容，看起来像
此外，方法1还有一些优点，最显著的是：

below_threshold
计数将非常快，因为您只需要一个减速机，而组合器将只用一个键就能创造奇迹
根据您的应用程序，您不需要UNION
。您可以只输出到两个位置，然后通过将它们视为来自pig的相同外部输出来“联合”。例如，您仍然可以执行hadoop fs-getmerge my_out/*/part-r-*output
来获取这两个输出

所以，我看到你的猪脚本看起来像：
sources = LOAD 'input.dat' as { a:chararray, b:int };
grouped = GROUP sources BY a;
projection = foreach sources generate flatten(group) as n, SUM(sources.b) as s;
SPLIT projection into above_threshold if s >= 20, below_threshold if s < 20;
dump above_threshold;

below_grouped = GROUP below_threshold BY 'other' PARALLEL 1;
below_projection = FOREACH below_grouped GENERATE group, SUM(below_threshold.s);
dump below_projection;

sources=LOAD'input.dat'为{a:chararray，b:int}；
分组=按a分组源；
投影=每个源生成展平（组）作为n，和（源.b）作为s；
如果s>=20，则将投影分为高于_阈值，如果s<20，则分为低于_阈值；
高于_阈值的转储；
低于_分组=低于_阈值的组由“其他”平行1；
低于\u投影=低于\u分组的每个生成组，总和（低于\u阈值.s）；
在_投影下方倾倒；