Join hadoop pig自连接性能

Join hadoop pig自连接性能,join,hadoop,apache-pig,Join,Hadoop,Apache Pig,我有一个用户和元素的数据集,我想在其中找到至少有一个重叠元素的任何用户对。我的数据结构如下所示: id element -------------- 1 a 1 b 1 b 2 b 3 a 4 c 在本例中,我将生成以下元组: (1,2) // both have element "b" in common (1,3) // both have element "a" in common 我已经编写了以下小规模的pig脚本,但当我

我有一个用户和元素的数据集,我想在其中找到至少有一个重叠元素的任何用户对。我的数据结构如下所示:

id    element
--------------
1     a    
1     b
1     b
2     b
3     a
4     c
在本例中,我将生成以下元组:

(1,2) // both have element "b" in common
(1,3) // both have element "a" in common
我已经编写了以下小规模的pig脚本,但当我编写了100万行(~500MB)时,我在1.5小时后就终止了该作业,因为它生成了近40GB的数据,这似乎与我试图完成的任务有点不成比例。我是新来的猪,所以我希望这可以优化一点。任何帮助都将不胜感激

-- load the data
mydata = LOAD '/path/to/my/data' USING PigStorage('\t') AS (user:int, element:chararray);
-- generate a copy to do a self join with
A = FOREACH mydata GENERATE user as user_2, element as element_2;
-- join them based on common tags
B = JOIN mydata BY element, A by element_2;
-- we only want the mapping in one direction, e.g. (1,2) is the same as (2,1)
C = FILTER B BY user < user_2;
-- we're only interested in the user ids
D = FOREACH C generate user, user_2;
-- remove any duplicate tuples
E = DISTINCT D;
STORE E INTO '/path/to/output';
——加载数据
mydata=使用PigStorage('\t')作为(用户:int,元素:chararray)加载'/path/to/my/data';
--生成一个副本以进行自联接
A=FOREACH mydata生成用户作为用户_2,元素作为元素_2;
--基于公共标记加入它们
B=按元素连接mydata,A按元素_2连接mydata;
--我们只需要一个方向上的映射,例如,(1,2)与(2,1)相同
C=用户过滤B

注意:这是我上一个问题的后续问题,方法略有不同

如果您的输入包含重复项,那么最好先过滤掉重复项,因为它们会导致组合爆炸

你可以尝试的另一件事是分组而不是加入。您可以立即得到结果,但不是作为一个对列表:

mydata = LOAD '/path/to/data.tsv' USING PigStorage('\t') AS (user:int, element:chararray);
A = GROUP mydata by element;
B = foreach A generate  (group, mydata.user) ;
举例说明B
然后给出:

---------------------------------------------------
| mydata     | user:int    | element:chararray    | 
---------------------------------------------------
|            | 1           | a                    | 
|            | 3           | a                    | 
---------------------------------------------------
---------------------------------------------------------------------------------------------
| A     | group:chararray    | mydata:bag{:tuple(user:int,element:chararray)}               | 
---------------------------------------------------------------------------------------------
|       | a                  | {(1, a), (3, a)}                                             | 
---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------
| B     | org.apache.pig.builtin.totuple_group_13:tuple(group:chararray,:bag{:tuple(user:int)})                     | 
---------------------------------------------------------------------------------------------------------------------
|       | (a, {(1), (3)})                                                                                           | 
---------------------------------------------------------------------------------------------------------------------
因此,在
B
中,您已经拥有共享一个元素的所有用户ID

要获得配对列表,必须执行以下操作:

C = foreach B {
    X = foreach $0 generate $0.$1;
    Y = foreach $0 generate $0.$1;
    F = CROSS X, Y ;
    generate $0.group, flatten(F);
};
但它不起作用。。。我得到:

org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [POProject (Name: Project[bag][1] - scope-131 Operator Key: scope-131) children: null at []]: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:338)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCross.accumulateData(POCross.java:202)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCross.getNextTuple(POCross.java:116)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNextDataBag(PhysicalOperator.java:385)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:590)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNextDataBag(PORelationToExprProject.java:106)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:309)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
    at org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:236)
    at org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:257)
    at org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:238)
    at org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:103)
    at org.apache.pig.pen.LineageTrimmingVisitor.<init>(LineageTrimmingVisitor.java:98)
    at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:166)
    at org.apache.pig.PigServer.getExamples(PigServer.java:1238)
    at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:831)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:802)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:381)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
    at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
    at org.apache.pig.Main.run(Main.java:541)
    at org.apache.pig.Main.main(Main.java:156)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNextTuple(POProject.java:476)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:592)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNextDataBag(POProject.java:247)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:309)
    ... 35 more
2014-03-20 01:28:57,235 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Encountered IOException. ExecException
org.apache.pig.backend.executionengine.ExecuteException:错误0:执行[POProject(名称:Project[bag][1]-scope-131运算符键:scope-131]时出现异常子项:null位于[]:java.lang.ClassCastException:java.lang.String无法转换为org.apache.pig.data.Tuple
位于org.apache.pig.backend.hadoop.executionengine.physicalayer.PhysicalOperator.getNext(PhysicalOperator.java:338)
位于org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
位于org.apache.pig.backend.hadoop.executionengine.physicalayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
位于org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCross.Accumeratedata(POCross.java:202)
位于org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCross.getNextTuple(POCross.java:116)
位于org.apache.pig.backend.hadoop.executionengine.physicalayer.physicalmoperator.getNextDataBag(physicalmoperator.java:385)
位于org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:590)
位于org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PorelationExprproject.getNextDataBag(PorelationExproject.java:106)
位于org.apache.pig.backend.hadoop.executionengine.physicalayer.PhysicalOperator.getNext(PhysicalOperator.java:309)
位于org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
位于org.apache.pig.backend.hadoop.executionengine.physicalayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
位于org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
位于org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
位于org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.Reduce(PigGenericMapReduce.java:412)
位于org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.Reduce(PigGenericMapReduce.java:256)
位于org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
位于org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:236)
位于org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:257)
位于org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:238)
位于org.apache.pig.pen.天堂TrimmingVisitor.init(天堂TrimmingVisitor.java:103)
位于org.apache.pig.pen.天堂TrimmingVisitor。(天堂TrimmingVisitor.java:98)
位于org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:166)
位于org.apache.pig.PigServer.getExamples(PigServer.java:1238)
位于org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:831)
位于org.apache.pig.tools.pigscript.parser.PigScriptParser.example(PigScriptParser.java:802)
位于org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:381)
位于org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
位于org.apache.pig.tools.grunt.GruntParser.parsetoponerror(GruntParser.java:173)
位于org.apache.pig.tools.grunt.grunt.run(grunt.java:69)
位于org.apache.pig.Main.run(Main.java:541)
位于org.apache.pig.Main.Main(Main.java:156)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处
在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)中
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)中
位于java.lang.reflect.Method.invoke(Method.java:606)
位于org.apache.hadoop.util.RunJar.main(RunJar.java:160)
原因:java.lang.ClassCastException:java.lang.String无法转换为org.apache.pig.data.Tuple
位于org.apache.pig.backend.hadoop.executionengine.physicalayer.expressionOperators.POProject.getNextTuple(POProject.java:476)
位于org.apache.pig.backend.hadoop.executionengine.physicalayer.expressionOperators.POProject.processInputBag(