Apache pig 如何编写这个Pig查询?
我在两个集合之间有一个多对多映射表。映射表中的每一行都表示一个可能的映射和权重分数Apache pig 如何编写这个Pig查询?,apache-pig,Apache Pig,我在两个集合之间有一个多对多映射表。映射表中的每一行都表示一个可能的映射和权重分数 mapping(id1, id2, weight) 查询:在id1和id2之间生成一对一映射。使用最小权重删除重复的映射。如果有tie,则输出任意一个 输入示例: (1, X, 1) (1, Y, 2) (2, X, 3) (2, Y, 1) (3, Z, 2) 输出 (1, X) (2, Y) (3, Z) 1和2都映射到X和Y。我们选择映射(1,X)和(2,Y),因为它们的权重最低。使用Java UDF
mapping(id1, id2, weight)
查询:在id1和id2之间生成一对一映射。使用最小权重删除重复的映射。如果有tie,则输出任意一个
输入示例:
(1, X, 1)
(1, Y, 2)
(2, X, 3)
(2, Y, 1)
(3, Z, 2)
输出
(1, X)
(2, Y)
(3, Z)
1和2都映射到X和Y。我们选择映射(1,X)和(2,Y),因为它们的权重最低。使用Java UDF解决了这个问题。从某种意义上讲,它并不完美,因为它不会最大化一对一映射的数量,但它已经足够好了 清管器: Java UDF:
package com.propeld.pig;
import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.Algebraic;
import org.apache.pig.EvalFunc;
import org.apache.pig.PigException;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class DEDUP extends EvalFunc<Tuple> implements Algebraic{
public String getInitial() {return Initial.class.getName();}
public String getIntermed() {return Intermed.class.getName();}
public String getFinal() {return Final.class.getName();}
static public class Initial extends EvalFunc<Tuple> {
private static TupleFactory tfact = TupleFactory.getInstance();
public Tuple exec(Tuple input) throws IOException {
// Initial is called in the map.
// we just send the tuple down
try {
// input is a bag with one tuple containing
// the column we are trying to operate on
DataBag bg = (DataBag) input.get(0);
if (bg.iterator().hasNext()) {
Tuple dba = (Tuple) bg.iterator().next();
return dba;
} else {
// make sure that we call the object constructor, not the list constructor
return tfact.newTuple((Object) null);
}
} catch (ExecException e) {
throw e;
} catch (Exception e) {
int errCode = 2106;
throw new ExecException("Error executing an algebraic function", errCode, PigException.BUG, e);
}
}
}
static public class Intermed extends EvalFunc<Tuple> {
public Tuple exec(Tuple input) throws IOException {
return dedup(input);
}
}
static public class Final extends EvalFunc<Tuple> {
public Tuple exec(Tuple input) throws IOException {return dedup(input);}
}
static protected Tuple dedup(Tuple input) throws ExecException, NumberFormatException {
DataBag values = (DataBag)input.get(0);
Double min = Double.MAX_VALUE;
Tuple result = null;
for (Iterator<Tuple> it = values.iterator(); it.hasNext();) {
Tuple t = (Tuple) it.next();
if ((Double)t.get(2) < min){
min = (Double)t.get(2);
result = t;
}
}
return result;
}
@Override
public Tuple exec(Tuple input) throws IOException {
return dedup(input);
}
}
package com.propeld.pig;
导入java.io.IOException;
导入java.util.Iterator;
导入org.apache.pig.algebratic;
导入org.apache.pig.EvalFunc;
导入org.apache.pig.PigException;
导入org.apache.pig.backend.executionengine.ExecuteException;
导入org.apache.pig.data.DataBag;
导入org.apache.pig.data.Tuple;
导入org.apache.pig.data.TupleFactory;
公共类重复数据消除扩展了EvalFunc{
公共字符串getInitial(){返回Initial.class.getName();}
公共字符串getIntermed(){返回Intermed.class.getName();}
公共字符串getFinal(){返回Final.class.getName();}
静态公共类初始扩展EvalFunc{
私有静态TupleFactory tfact=TupleFactory.getInstance();
公共元组执行(元组输入)引发IOException{
//在映射中调用Initial。
//我们只是把元组发送下来
试一试{
//输入是一个包,其中一个元组包含
//我们正在尝试操作的专栏
DataBag bg=(DataBag)input.get(0);
if(bg.iterator().hasNext()){
元组dba=(元组)bg.iterator().next();
返回dba;
}否则{
//确保我们调用的是对象构造函数,而不是列表构造函数
返回tfact.newTuple((Object)null);
}
}捕获(执行异常){
投掷e;
}捕获(例外e){
int errCode=2106;
抛出新的ExecException(“执行代数函数时出错”,errCode,PigException.BUG,e);
}
}
}
静态公共类Intermed扩展了EvalFunc{
公共元组执行(元组输入)引发IOException{
返回重复数据消除(输入);
}
}
静态公共类Final扩展EvalFunc{
公共元组执行(元组输入)抛出IOException{返回重复数据消除(输入);}
}
静态保护元组重复数据消除(元组输入)引发ExecException、NumberFormatException{
数据包值=(数据包)输入.get(0);
双最小值=双最大值;
元组结果=null;
for(Iterator it=values.Iterator();it.hasNext();){
Tuple t=(Tuple)it.next();
如果((双)t.get(2)
我将假设您只对权重是涉及id1的所有映射中最低的映射感兴趣,同时也是涉及id2的所有映射中最低的映射感兴趣。例如,如果您另外有映射(2,Y,4),它将不会与(1,X,1)冲突。我将排除这类映射,因为权重小于(1,Y,2)和(2,X,3),这两个值不合格
我的解决方案如下:找到每个id1的最小映射权重,然后将其加入映射关系以供将来参考。使用a检查每个id2:使用ORDER和LIMIT选择该id2的最小权重记录,然后仅当该id1的最小权重也是该id1的最小权重时才保留该记录
以下是对您的输入进行测试的完整脚本:
mapping = LOAD 'input' AS (id1:chararray, id2:chararray, weight:double);
id1_weights =
FOREACH (GROUP mapping BY id1)
GENERATE group AS id1, MIN(mapping.weight) AS id1_min_weight;
mapping_with_id1_mins =
FOREACH (JOIN mapping BY id1, id1_weights BY id1)
GENERATE mapping::id1, id2, weight, id1_min_weight;
accepted_mappings =
FOREACH (GROUP mapping_with_id1_mins BY id2)
{
ordered = ORDER mapping_with_id1_mins BY weight;
selected = LIMIT ordered 1;
acceptable = FILTER selected BY weight == id1_min_weight;
GENERATE FLATTEN(acceptable);
};
DUMP accepted_mappings;
这看起来很棒!您觉得我下面的UDF解决方案怎么样?它避免了连接,但编写java UDF比编写pig更为复杂。UDF解决方案可能更快,因为它似乎只需要两个reduce阶段,而不是三个reduce阶段。然而,我喜欢尽可能使用纯猪,纯粹是出于个人喜好。此外,我对您问题的评论的回答将对UDF解决方案是否可接受产生影响。如果您的输入还包括映射(3,Y,1.5),该怎么办?你还应该输出(3,Z)吗?这是一个有趣的问题。实际上,大多数重复映射都是一对多或多对一的。这对于多对多来说是罕见的。所以我对这两种方法都很满意。目标是找到最精确的一对一映射。
mapping = LOAD 'input' AS (id1:chararray, id2:chararray, weight:double);
id1_weights =
FOREACH (GROUP mapping BY id1)
GENERATE group AS id1, MIN(mapping.weight) AS id1_min_weight;
mapping_with_id1_mins =
FOREACH (JOIN mapping BY id1, id1_weights BY id1)
GENERATE mapping::id1, id2, weight, id1_min_weight;
accepted_mappings =
FOREACH (GROUP mapping_with_id1_mins BY id2)
{
ordered = ORDER mapping_with_id1_mins BY weight;
selected = LIMIT ordered 1;
acceptable = FILTER selected BY weight == id1_min_weight;
GENERATE FLATTEN(acceptable);
};
DUMP accepted_mappings;