Apache pig 如何编写这个Pig查询?

Apache pig 如何编写这个Pig查询?,apache-pig,Apache Pig,我在两个集合之间有一个多对多映射表。映射表中的每一行都表示一个可能的映射和权重分数 mapping(id1, id2, weight) 查询:在id1和id2之间生成一对一映射。使用最小权重删除重复的映射。如果有tie,则输出任意一个 输入示例: (1, X, 1) (1, Y, 2) (2, X, 3) (2, Y, 1) (3, Z, 2) 输出 (1, X) (2, Y) (3, Z) 1和2都映射到X和Y。我们选择映射(1,X)和(2,Y),因为它们的权重最低。使用Java UDF

我在两个集合之间有一个多对多映射表。映射表中的每一行都表示一个可能的映射和权重分数

mapping(id1, id2, weight)
查询:在id1和id2之间生成一对一映射。使用最小权重删除重复的映射。如果有tie,则输出任意一个

输入示例:

(1, X, 1)
(1, Y, 2)
(2, X, 3)
(2, Y, 1)
(3, Z, 2)
输出

(1, X)
(2, Y)
(3, Z)

1和2都映射到X和Y。我们选择映射(1,X)和(2,Y),因为它们的权重最低。

使用Java UDF解决了这个问题。从某种意义上讲,它并不完美,因为它不会最大化一对一映射的数量,但它已经足够好了

清管器:

Java UDF:

package com.propeld.pig;

import java.io.IOException;
import java.util.Iterator;

import org.apache.pig.Algebraic;
import org.apache.pig.EvalFunc;
import org.apache.pig.PigException;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;

public class DEDUP extends EvalFunc<Tuple> implements Algebraic{
    public String getInitial() {return Initial.class.getName();}
    public String getIntermed() {return Intermed.class.getName();}
    public String getFinal() {return Final.class.getName();}
    static public class Initial extends EvalFunc<Tuple> {
        private static TupleFactory tfact = TupleFactory.getInstance();
        public Tuple exec(Tuple input) throws IOException {
            // Initial is called in the map.
            // we just send the tuple down
            try {
                // input is a bag with one tuple containing
                // the column we are trying to operate on
                DataBag bg = (DataBag) input.get(0);
                if (bg.iterator().hasNext()) {
                    Tuple dba = (Tuple) bg.iterator().next();
                    return dba;
                } else {
                    // make sure that we call the object constructor, not the list constructor
                    return tfact.newTuple((Object) null);
                }
            } catch (ExecException e) {
                throw e;
            } catch (Exception e) {
                int errCode = 2106;
                throw new ExecException("Error executing an algebraic function", errCode, PigException.BUG, e);
            }
        }
    }
    static public class Intermed extends EvalFunc<Tuple> {
        public Tuple exec(Tuple input) throws IOException {
            return dedup(input);
        }
    }
    static public class Final extends EvalFunc<Tuple> {
        public Tuple exec(Tuple input) throws IOException {return dedup(input);}
    }

    static protected Tuple dedup(Tuple input) throws ExecException, NumberFormatException {
        DataBag values = (DataBag)input.get(0);
        Double min = Double.MAX_VALUE;
        Tuple result = null;
        for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
            Tuple t = (Tuple) it.next();

            if ((Double)t.get(2) < min){
                min = (Double)t.get(2);
                result = t;
            }
        }
        return result;
    }

    @Override
    public Tuple exec(Tuple input) throws IOException {
        return dedup(input);
    }
}
package com.propeld.pig;
导入java.io.IOException;
导入java.util.Iterator;
导入org.apache.pig.algebratic;
导入org.apache.pig.EvalFunc;
导入org.apache.pig.PigException;
导入org.apache.pig.backend.executionengine.ExecuteException;
导入org.apache.pig.data.DataBag;
导入org.apache.pig.data.Tuple;
导入org.apache.pig.data.TupleFactory;
公共类重复数据消除扩展了EvalFunc{
公共字符串getInitial(){返回Initial.class.getName();}
公共字符串getIntermed(){返回Intermed.class.getName();}
公共字符串getFinal(){返回Final.class.getName();}
静态公共类初始扩展EvalFunc{
私有静态TupleFactory tfact=TupleFactory.getInstance();
公共元组执行(元组输入)引发IOException{
//在映射中调用Initial。
//我们只是把元组发送下来
试一试{
//输入是一个包,其中一个元组包含
//我们正在尝试操作的专栏
DataBag bg=(DataBag)input.get(0);
if(bg.iterator().hasNext()){
元组dba=(元组)bg.iterator().next();
返回dba;
}否则{
//确保我们调用的是对象构造函数,而不是列表构造函数
返回tfact.newTuple((Object)null);
}
}捕获(执行异常){
投掷e;
}捕获(例外e){
int errCode=2106;
抛出新的ExecException(“执行代数函数时出错”,errCode,PigException.BUG,e);
}
}
}
静态公共类Intermed扩展了EvalFunc{
公共元组执行(元组输入)引发IOException{
返回重复数据消除(输入);
}
}
静态公共类Final扩展EvalFunc{
公共元组执行(元组输入)抛出IOException{返回重复数据消除(输入);}
}
静态保护元组重复数据消除(元组输入)引发ExecException、NumberFormatException{
数据包值=(数据包)输入.get(0);
双最小值=双最大值;
元组结果=null;
for(Iterator it=values.Iterator();it.hasNext();){
Tuple t=(Tuple)it.next();
如果((双)t.get(2)
我将假设您只对权重是涉及id1的所有映射中最低的映射感兴趣,同时也是涉及id2的所有映射中最低的映射感兴趣。例如,如果您另外有映射(2,Y,4),它将不会与(1,X,1)冲突。我将排除这类映射,因为权重小于(1,Y,2)和(2,X,3),这两个值不合格

我的解决方案如下:找到每个id1的最小映射权重,然后将其加入映射关系以供将来参考。使用a检查每个id2:使用ORDER和LIMIT选择该id2的最小权重记录,然后仅当该id1的最小权重也是该id1的最小权重时才保留该记录

以下是对您的输入进行测试的完整脚本:

mapping = LOAD 'input' AS (id1:chararray, id2:chararray, weight:double);

id1_weights =
    FOREACH (GROUP mapping BY id1)
    GENERATE group AS id1, MIN(mapping.weight) AS id1_min_weight;
mapping_with_id1_mins =
    FOREACH (JOIN mapping BY id1, id1_weights BY id1)
    GENERATE mapping::id1, id2, weight, id1_min_weight;

accepted_mappings =
    FOREACH (GROUP mapping_with_id1_mins BY id2)
    {
        ordered = ORDER mapping_with_id1_mins BY weight;
        selected = LIMIT ordered 1;
        acceptable = FILTER selected BY weight == id1_min_weight;
        GENERATE FLATTEN(acceptable);
    };

DUMP accepted_mappings;

这看起来很棒!您觉得我下面的UDF解决方案怎么样?它避免了连接,但编写java UDF比编写pig更为复杂。UDF解决方案可能更快,因为它似乎只需要两个reduce阶段,而不是三个reduce阶段。然而,我喜欢尽可能使用纯猪,纯粹是出于个人喜好。此外,我对您问题的评论的回答将对UDF解决方案是否可接受产生影响。如果您的输入还包括映射(3,Y,1.5),该怎么办?你还应该输出(3,Z)吗?这是一个有趣的问题。实际上,大多数重复映射都是一对多或多对一的。这对于多对多来说是罕见的。所以我对这两种方法都很满意。目标是找到最精确的一对一映射。
mapping = LOAD 'input' AS (id1:chararray, id2:chararray, weight:double);

id1_weights =
    FOREACH (GROUP mapping BY id1)
    GENERATE group AS id1, MIN(mapping.weight) AS id1_min_weight;
mapping_with_id1_mins =
    FOREACH (JOIN mapping BY id1, id1_weights BY id1)
    GENERATE mapping::id1, id2, weight, id1_min_weight;

accepted_mappings =
    FOREACH (GROUP mapping_with_id1_mins BY id2)
    {
        ordered = ORDER mapping_with_id1_mins BY weight;
        selected = LIMIT ordered 1;
        acceptable = FILTER selected BY weight == id1_min_weight;
        GENERATE FLATTEN(acceptable);
    };

DUMP accepted_mappings;