Hadoop 猪:如何展平&;将行李重新连接到行李中
我举了一个例子,我们试图做一个看似简单的连接:Hadoop 猪:如何展平&;将行李重新连接到行李中,hadoop,apache-pig,Hadoop,Apache Pig,我举了一个例子,我们试图做一个看似简单的连接: A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} ); B = load 'data7' as ( v:chararray, r:chararray ); grunt> cat data1 'item1' 111 { ('thing1', 222,
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
grunt> cat data1
'item1' 111 { ('thing1', 222, {('value1'),('value2')}) }
grunt> cat data2
'value1' 'result1'
'value2' 'result2'
我们希望将data2
的'result1'
,'result2'
数据加入到data1
中明显的值
字段的条目中
我们设法把它弄平了:
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
F1 = foreach A generate item, d, flatten(things);
F2 = foreach F1 generate item..d1, flatten(values);
然后,我们加入了第二个数据集:
J = join F2 by v, B by v
J1 = foreach J generate item as item, d as d, thing as thing, d1 as d1, F2::things::values::v as v, r as r; --Remove duplicate field & clean up naming
dump J1
('item1',111,'thing1',222,'value1','result1')
('item1',111,'thing1',222,'value2','result2')
现在我们需要为每个项目调用一次UDF函数,因此我们需要重新分组这两个级别的行李。每个项目都有0个或多个内容,每个内容都有0个或多个值,这些值现在可能有结果,也可能没有结果
我们如何回到:
('item1', 111, { 'thing1', 222, { ('value1, 'result1'), ('value2', 'result2') }
我所有的分组和重新连接尝试都变得非常复杂,未能产生正确的结果,并在4个以上的mapreduce作业中运行,而Hadoop中应该是1个mapreduce作业。以下代码可能有效,R2是最终结果:
group_by_item_d_thing_d1 = group J1 by item, d, thing, d1;
R1 = foreach group_by_item_d_thing_d1 generate group.item, group.d, group.thing, group.d1, J1;
group_by_item_d = group R1 by item, d;
R2 = foreach group_by_item_d generate group.item, group.d, R1;
以下代码可能有效,R2是最终结果:
group_by_item_d_thing_d1 = group J1 by item, d, thing, d1;
R1 = foreach group_by_item_d_thing_d1 generate group.item, group.d, group.thing, group.d1, J1;
group_by_item_d = group R1 by item, d;
R2 = foreach group_by_item_d generate group.item, group.d, R1;