Apache pig 猪:把所有的元组从一个分组的袋子里拿出来

Apache pig 猪:把所有的元组从一个分组的袋子里拿出来,apache-pig,Apache Pig,我使用PIG从元组生成组,如下所示: a1, b1 a1, b2 a1, b3 ... -> a1, [b1, b2, b3] ... 这很简单,也很有效。但我的问题是得到以下结果:从获得的组中,我想生成组包中所有元组的集合: a1, [b1, b2, b3] -> b1,b2 b1,b3 b2,b3 如果我能嵌套“foreach”并首先遍历每个组,然后遍历它的包,这将很容易 我想我误解了这个概念,我会感谢你的解释 谢谢。看起来你需要一个笛卡尔积在袋子和它之间。要做到这一

我使用PIG从元组生成组,如下所示:

a1, b1
a1, b2
a1, b3
...

->

a1, [b1, b2, b3]
...
这很简单,也很有效。但我的问题是得到以下结果:从获得的组中,我想生成组包中所有元组的集合:

a1, [b1, b2, b3]

->

b1,b2
b1,b3
b2,b3
如果我能嵌套“foreach”并首先遍历每个组,然后遍历它的包,这将很容易

我想我误解了这个概念,我会感谢你的解释


谢谢。

看起来你需要一个笛卡尔积在袋子和它之间。要做到这一点,你需要使用扁平(袋)两次

代码:

inpt = load '.../group.txt' using PigStorage(',') as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as value_bag;
result = foreach id_grp generate id, FLATTEN(value_bag) as v1, FLATTEN(value_bag) as v2; 
dump result;
请注意,大袋子会产生很多行。为了避免这种情况,可以在展平之前使用TOP(…):

inpt = load '....group.txt' using PigStorage(',')  as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
    limited_bag = TOP(50, 0, values); -- all sorts of filtering could be done here
    generate id, FLATTEN(limited_bag) as v1, FLATTEN(limited_bag) as v2; 
};
dump result;
inpt = load '..../group.txt' as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
    l = filter values by val == 'b1' or val == 'b2';
    generate id, FLATTEN(l) as v1, FLATTEN(values) as v2; 
};
result = filter result by v1 != v2;
对于您的特定输出,您可以在展平之前使用一些过滤:

inpt = load '....group.txt' using PigStorage(',')  as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
    limited_bag = TOP(50, 0, values); -- all sorts of filtering could be done here
    generate id, FLATTEN(limited_bag) as v1, FLATTEN(limited_bag) as v2; 
};
dump result;
inpt = load '..../group.txt' as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
    l = filter values by val == 'b1' or val == 'b2';
    generate id, FLATTEN(l) as v1, FLATTEN(values) as v2; 
};
result = filter result by v1 != v2;
我希望有帮助


Cheers

UDF库中的此函数也与此相关。它生成一个包中的所有项目对(在您的情况下是您的分组包)

您可以使用
GROUP all
pig语句生成

A  = -- Some bag
B  = -- Another bag

groupedB = group B ALL;
result   = foreach A GENERATE 
    TOTUPLE(*), groupedB.$1;

-- Will generate
((a1), {(b1, b2, b3)})
((a2), {(b1, b2, b3)})
((a3), {(b1, b2, b3)})
...

劳伦斯是对的。这个UDF正是您所需要的,它也比使用笛卡尔积的纯Pig解决方案更有效。顺便说一下,URL已更改: