Hadoop apache pig中Union和Join的组合
我在hdfs中有两个文件,其中包含以下数据:File1:Hadoop apache pig中Union和Join的组合,hadoop,apache-pig,Hadoop,Apache Pig,我在hdfs中有两个文件,其中包含以下数据:File1: id,name,age 1,x1,15 2,x2,14 3,x3,16 文件2: id,name,grades 1,x1,A 2,x2,B 4,y1,A 5,y2,C 我希望生成以下输出: id,name,age,grades 1,x1,15,A 2,x2,14,B 3,x3,16, 4,y1,,A 5,y2,,C 我正在使用ApachePig执行操作,是否可以在pig中获得上述输出。这是一种并集并同时加入两者。正如您可以在pig中
id,name,age
1,x1,15
2,x2,14
3,x3,16
文件2:
id,name,grades
1,x1,A
2,x2,B
4,y1,A
5,y2,C
我希望生成以下输出:
id,name,age,grades
1,x1,15,A
2,x2,14,B
3,x3,16,
4,y1,,A
5,y2,,C
我正在使用ApachePig执行操作,是否可以在pig中获得上述输出。这是一种并集并同时加入两者。正如您可以在pig中进行并集和加入一样,这当然是可能的 不必深入研究确切的语法,我可以告诉您这应该是可行的(过去曾使用过类似的解决方案)
A = load 'pdemo/File1' using PigStorage(',') as(id:int,name:chararray,age:chararray);
B = load 'pdemo/File2' using PigStorage(',') as(id:int,name:chararray,grades:chararray);
lj = join A by id left outer,B by id;
rj = join A by id right outer,B by id;
lj1 = foreach lj generate A::id as id,A::name as name,A::age as age,B::grades as grades;
rj1 = foreach rj generate B::id as id,B::name as name,A::age as age,B::grades as grades;
res = union lj1,rj1;
FinalResult = distinct res;
根据性能,第二种方法更好
A1 = foreach A generate id,name;
B1 = foreach B generate id,name;
M2 = union A1,B1;
M2 = distinct M2;
M2A = JOIN M2 by id left outer,A by id;
M2AB = JOIN M2A by M2::id left outer, B by id;
Res = foreach M2AB generate M2A::M2::id as id,M2A::M2::name as name,M2A::A::age as age,B::grades as grades;
希望这会有帮助 我认为这是可行的,但人们应该意识到,最终执行distinct意味着内存需求可能比我建议的解决方案要高得多基本上你做了一个隐藏的完全外部连接,所以我想你最好还是用它来代替左右连接。这很有帮助,我将尝试对任意两个文件(有变量列和变量公共列)进行自动化。如果您对此有任何想法,请提出建议。
u1 = load 'PigDir/u1' using PigStorage(',') as (id:int,name:chararray,age:int);
u2 = load 'PigDir/u2' using PigStorage(',') as (id:int, name:chararray,grades:chararray);
uj = join u2 by id full outer,u1 by id;
uif = foreach uj generate ($0 is null ?$3:$0) as id,($1 is null ? $4 : $1) as name,$5 as age,$2 as grades;