Join 如何在pig中生成一定数量的元组?

Join 如何在pig中生成一定数量的元组?,join,apache-pig,Join,Apache Pig,我有以下数据集: A: x1 y z1 x2 y z2 x3 y z3 x43 y z33 x4 y2 z4 x5 y2 z5 x6 y2 z6 x7 y2 z7 y 12 y2 25 B: x1 y z1 x2 y z2 x3 y z3 x43 y z33 x4 y2 z4 x5 y2 z5 x6 y2 z6 x7 y2 z7 y 12 y2 25 加载A:使用PigStorage()加载“$input”,作为(k:chararray,m:chararray,n:chararray);

我有以下数据集:

A:

x1 y z1
x2 y z2
x3 y z3
x43 y z33
x4 y2 z4
x5 y2 z5
x6 y2 z6
x7 y2 z7
y 12
y2 25
B:

x1 y z1
x2 y z2
x3 y z3
x43 y z33
x4 y2 z4
x5 y2 z5
x6 y2 z6
x7 y2 z7
y 12
y2 25
加载A:使用PigStorage()加载“$input”,作为(k:chararray,m:chararray,n:chararray); 加载B:使用PigStorage()作为(o:chararray,p:int)加载“$input2”

我在m上加入A,在o上加入B。我想做的是为每个o只选择x个元组。例如,如果x为2,结果是:

x1 y z1
x2 y z2
x4 y2 z4
x5 y2 z5

为此,需要使用GROUP BY、FOREACH和嵌套限制,然后使用JOIN或COGROUP。参见Pig 0.10中的实现,我使用您输入的数据来获得指定的输出:

A = load '~/pig/data/subset_join_A.dat' as (k:chararray, m:chararray, n:chararray);
B = load '~/pig/data/subset_join_B.dat' as (o:chararray, p:int);
-- as join will be on m, we need to leave only 2 rows per a value in m.
group_A = group A by m;
top_A_x = foreach group_A {
    top = limit A 2; -- where x = 2
    generate flatten(top);
};

-- another way to do join, allows us to do left or right joins and checks
co_join = cogroup top_A_x by (m), B by (o);
-- filter out records from A that are not in B
filter_join = filter co_join by IsEmpty(B) == false;
result = foreach filter_join generate flatten(top_A_x);
或者,您可以只使用一个COGROUP实现它,FOREACH具有嵌套限制:

A = load '~/pig/data/subset_join_A.dat' as (k:chararray, m:chararray, n:chararray);
B = load '~/pig/data/subset_join_B.dat' as (o:chararray, p:int);

co_join = cogroup A by (m), B by (o);
filter_join = filter co_join by IsEmpty(B) == false;
result = foreach filter_join {
    top = limit A 2;
--you can limit B as well
    generate flatten(top);
};

为此,需要使用GROUP BY、FOREACH和嵌套限制,然后使用JOIN或COGROUP。参见Pig 0.10中的实现,我使用您输入的数据来获得指定的输出:

A = load '~/pig/data/subset_join_A.dat' as (k:chararray, m:chararray, n:chararray);
B = load '~/pig/data/subset_join_B.dat' as (o:chararray, p:int);
-- as join will be on m, we need to leave only 2 rows per a value in m.
group_A = group A by m;
top_A_x = foreach group_A {
    top = limit A 2; -- where x = 2
    generate flatten(top);
};

-- another way to do join, allows us to do left or right joins and checks
co_join = cogroup top_A_x by (m), B by (o);
-- filter out records from A that are not in B
filter_join = filter co_join by IsEmpty(B) == false;
result = foreach filter_join generate flatten(top_A_x);
或者,您可以只使用一个COGROUP实现它,FOREACH具有嵌套限制:

A = load '~/pig/data/subset_join_A.dat' as (k:chararray, m:chararray, n:chararray);
B = load '~/pig/data/subset_join_B.dat' as (o:chararray, p:int);

co_join = cogroup A by (m), B by (o);
filter_join = filter co_join by IsEmpty(B) == false;
result = foreach filter_join {
    top = limit A 2;
--you can limit B as well
    generate flatten(top);
};