Apache pig 如何编写这个Pig查询？_Apache Pig

Apache pig 如何编写这个Pig查询？

apache-pig

Apache pig 如何编写这个Pig查询？,apache-pig,Apache Pig,我在两个集合之间有一个多对多映射表。映射表中的每一行都表示一个可能的映射和权重分数 mapping(id1, id2, weight) 查询：在id1和id2之间生成一对一映射。使用最小权重删除重复的映射。如果有tie，则输出任意一个输入示例： (1, X, 1) (1, Y, 2) (2, X, 3) (2, Y, 1) (3, Z, 2) 输出 (1, X) (2, Y) (3, Z) 1和2都映射到X和Y。我们选择映射（1，X）和（2，Y），因为它们的权重最低。使用Java UDF

我在两个集合之间有一个多对多映射表。映射表中的每一行都表示一个可能的映射和权重分数

mapping(id1, id2, weight)

查询：在id1和id2之间生成一对一映射。使用最小权重删除重复的映射。如果有tie，则输出任意一个

输入示例：

(1, X, 1)
(1, Y, 2)
(2, X, 3)
(2, Y, 1)
(3, Z, 2)

输出

(1, X)
(2, Y)
(3, Z)

1和2都映射到X和Y。我们选择映射（1，X）和（2，Y），因为它们的权重最低。

使用Java UDF解决了这个问题。从某种意义上讲，它并不完美，因为它不会最大化一对一映射的数量，但它已经足够好了

清管器：

Java UDF：

package com.propeld.pig;

import java.io.IOException;
import java.util.Iterator;

import org.apache.pig.Algebraic;
import org.apache.pig.EvalFunc;
import org.apache.pig.PigException;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;

public class DEDUP extends EvalFunc<Tuple> implements Algebraic{
    public String getInitial() {return Initial.class.getName();}
    public String getIntermed() {return Intermed.class.getName();}
    public String getFinal() {return Final.class.getName();}
    static public class Initial extends EvalFunc<Tuple> {
        private static TupleFactory tfact = TupleFactory.getInstance();
        public Tuple exec(Tuple input) throws IOException {
            // Initial is called in the map.
            // we just send the tuple down
            try {
                // input is a bag with one tuple containing
                // the column we are trying to operate on
                DataBag bg = (DataBag) input.get(0);
                if (bg.iterator().hasNext()) {
                    Tuple dba = (Tuple) bg.iterator().next();
                    return dba;
                } else {
                    // make sure that we call the object constructor, not the list constructor
                    return tfact.newTuple((Object) null);
                }
            } catch (ExecException e) {
                throw e;
            } catch (Exception e) {
                int errCode = 2106;
                throw new ExecException("Error executing an algebraic function", errCode, PigException.BUG, e);
            }
        }
    }
    static public class Intermed extends EvalFunc<Tuple> {
        public Tuple exec(Tuple input) throws IOException {
            return dedup(input);
        }
    }
    static public class Final extends EvalFunc<Tuple> {
        public Tuple exec(Tuple input) throws IOException {return dedup(input);}
    }

    static protected Tuple dedup(Tuple input) throws ExecException, NumberFormatException {
        DataBag values = (DataBag)input.get(0);
        Double min = Double.MAX_VALUE;
        Tuple result = null;
        for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
            Tuple t = (Tuple) it.next();

            if ((Double)t.get(2) < min){
                min = (Double)t.get(2);
                result = t;
            }
        }
        return result;
    }

    @Override
    public Tuple exec(Tuple input) throws IOException {
        return dedup(input);
    }
}

package com.propeld.pig；
导入java.io.IOException；
导入java.util.Iterator；
导入org.apache.pig.algebratic；
导入org.apache.pig.EvalFunc；
导入org.apache.pig.PigException；
导入org.apache.pig.backend.executionengine.ExecuteException；
导入org.apache.pig.data.DataBag；
导入org.apache.pig.data.Tuple；
导入org.apache.pig.data.TupleFactory；
公共类重复数据消除扩展了EvalFunc{
公共字符串getInitial（）{返回Initial.class.getName（）；}
公共字符串getIntermed（）{返回Intermed.class.getName（）；}
公共字符串getFinal（）{返回Final.class.getName（）；}
静态公共类初始扩展EvalFunc{
私有静态TupleFactory tfact=TupleFactory.getInstance（）；
公共元组执行（元组输入）引发IOException{
//在映射中调用Initial。
//我们只是把元组发送下来
试一试{
//输入是一个包，其中一个元组包含
//我们正在尝试操作的专栏
DataBag bg=（DataBag）input.get（0）；
if（bg.iterator（）.hasNext（））{
元组dba=（元组）bg.iterator（）.next（）；
返回dba；
}否则{
//确保我们调用的是对象构造函数，而不是列表构造函数
返回tfact.newTuple（（Object）null）；
}
}捕获（执行异常）{
投掷e；
}捕获（例外e）{
int errCode=2106；
抛出新的ExecException（“执行代数函数时出错”，errCode，PigException.BUG，e）；
}
}
}
静态公共类Intermed扩展了EvalFunc{
公共元组执行（元组输入）引发IOException{
返回重复数据消除（输入）；
}
}
静态公共类Final扩展EvalFunc{
公共元组执行（元组输入）抛出IOException{返回重复数据消除（输入）；}
}
静态保护元组重复数据消除（元组输入）引发ExecException、NumberFormatException{
数据包值=（数据包）输入.get（0）；
双最小值=双最大值；
元组结果=null；
for（Iterator it=values.Iterator（）；it.hasNext（）；）{
Tuple t=（Tuple）it.next（）；
如果（（双）t.get（2）

我将假设您只对权重是涉及id1的所有映射中最低的映射感兴趣，同时也是涉及id2的所有映射中最低的映射感兴趣。例如，如果您另外有映射（2，Y，4），它将不会与（1，X，1）冲突。我将排除这类映射，因为权重小于（1，Y，2）和（2，X，3），这两个值不合格

我的解决方案如下：找到每个id1的最小映射权重，然后将其加入映射关系以供将来参考。使用a检查每个id2：使用ORDER和LIMIT选择该id2的最小权重记录，然后仅当该id1的最小权重也是该id1的最小权重时才保留该记录

以下是对您的输入进行测试的完整脚本：

mapping = LOAD 'input' AS (id1:chararray, id2:chararray, weight:double);

id1_weights =
    FOREACH (GROUP mapping BY id1)
    GENERATE group AS id1, MIN(mapping.weight) AS id1_min_weight;
mapping_with_id1_mins =
    FOREACH (JOIN mapping BY id1, id1_weights BY id1)
    GENERATE mapping::id1, id2, weight, id1_min_weight;

accepted_mappings =
    FOREACH (GROUP mapping_with_id1_mins BY id2)
    {
        ordered = ORDER mapping_with_id1_mins BY weight;
        selected = LIMIT ordered 1;
        acceptable = FILTER selected BY weight == id1_min_weight;
        GENERATE FLATTEN(acceptable);
    };

DUMP accepted_mappings;

这看起来很棒！您觉得我下面的UDF解决方案怎么样？它避免了连接，但编写java UDF比编写pig更为复杂。UDF解决方案可能更快，因为它似乎只需要两个reduce阶段，而不是三个reduce阶段。然而，我喜欢尽可能使用纯猪，纯粹是出于个人喜好。此外，我对您问题的评论的回答将对UDF解决方案是否可接受产生影响。如果您的输入还包括映射（3，Y，1.5），该怎么办？你还应该输出（3，Z）吗？这是一个有趣的问题。实际上，大多数重复映射都是一对多或多对一的。这对于多对多来说是罕见的。所以我对这两种方法都很满意。目标是找到最精确的一对一映射。

mapping = LOAD 'input' AS (id1:chararray, id2:chararray, weight:double);

id1_weights =
    FOREACH (GROUP mapping BY id1)
    GENERATE group AS id1, MIN(mapping.weight) AS id1_min_weight;
mapping_with_id1_mins =
    FOREACH (JOIN mapping BY id1, id1_weights BY id1)
    GENERATE mapping::id1, id2, weight, id1_min_weight;

accepted_mappings =
    FOREACH (GROUP mapping_with_id1_mins BY id2)
    {
        ordered = ORDER mapping_with_id1_mins BY weight;
        selected = LIMIT ordered 1;
        acceptable = FILTER selected BY weight == id1_min_weight;
        GENERATE FLATTEN(acceptable);
    };

DUMP accepted_mappings;