为什么这只猪的UDF会导致;错误:Java堆空间;考虑到我正在将数据包溢出到磁盘?

为什么这只猪的UDF会导致;错误:Java堆空间;考虑到我正在将数据包溢出到磁盘?,java,hadoop,out-of-memory,apache-pig,Java,Hadoop,Out Of Memory,Apache Pig,以下是我的UDF: public DataBag exec(Tuple input) throws IOException { Aggregate aggregatedOutput = null; int spillCount = 0; DataBag outputBag = BagFactory.newDefaultBag(); DataBag values = (DataBag)input.get(0); for (Iterator&l

以下是我的UDF:

public DataBag exec(Tuple input) throws IOException { 
    Aggregate aggregatedOutput = null;
    
    int spillCount = 0;

    DataBag outputBag = BagFactory.newDefaultBag(); 
    DataBag values = (DataBag)input.get(0);
    for (Iterator<Tuple> iterator = values.iterator(); iterator.hasNext();) {
        Tuple tuple = iterator.next();
        //spillCount++;
        ...
        if (some condition regarding current input tuple){
            //do something to aggregatedOutput with information from input tuple
        } else {
            //Because input tuple does not apply to current aggregateOutput
            //return current aggregateOutput and apply input tuple
            //to new aggregateOutput
            Tuple returnTuple = aggregatedOutput.getTuple();
            outputBag.add(returnTuple);
            spillCount++;
            aggregatedOutputTuple = new Aggregate(tuple);
            
            
            if (spillCount == 1000) {
                outputBag.spill();
                spillCount = 0;
            }
        }
    }
    return outputBag; 
}
我能做些什么来解决这个问题?它正在处理大约一百万行

这是解决办法 使用累加器接口:

public class Foo extends EvalFunc<DataBag> implements Accumulator<DataBag> {
    private DataBag outputBag = null;
    private UltraAggregation currentAggregation = null;
    
    public void accumulate(Tuple input) throws IOException {
        DataBag values = (DataBag)input.get(0);
        Aggregate aggregatedOutput = null;
        outputBag = BagFactory.getInstance().newDefaultBag();
        
        for (Iterator<Tuple> iterator = values.iterator(); iterator.hasNext();) {
            Tuple tuple = iterator.next();
            ...
            if (some condition regarding current input tuple){
                //do something to aggregatedOutput with information from input tuple
            } else {
                //Because input tuple does not apply to current aggregateOutput
                //return current aggregateOutput and apply input tuple
                //to new aggregateOutput
                outputBag.add(aggregatedOutput.getTuple());
                aggregatedOutputTuple = new Aggregate(tuple);
            }
        }
    }
    
    // Called when all tuples from current key have been passed to accumulate
    public DataBag getValue() {
        //Add final current aggregation
        outputBag.add(currentAggregation.getTuple());
        return outputBag;
    }
    // This is called after getValue()
    // Not sure if these commands are necessary as they are repeated in beginning of accumulate
    public void cleanup() {
        outputBag = null;
        currentAggregation = null;
    }
    
    public DataBag exec(Tuple input) throws IOException {
        // Same as above ^^ but this doesn't appear to ever be called.
    }
    
    public Schema outputSchema(Schema input) {
        try {
            return new Schema(new FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), bagSchema, DataType.BAG));
        } catch {FrontendException e) {
            e.printStackTrace();
            return null;
        }
    }
    
    class Aggregate {
        ...
        public Tuple getTuple() {
            Tuple output = TupleFactory.getInstance().newTuple(OUTPUT_TUPLE_SIZE);
            try {
                output.set(0, val);
                ...
            } catch (ExecException e) {
                e.printStackTrace();
                return null;
            }
        }
        ...
    }
}
公共类Foo扩展EvalFunc实现累加器{
私有数据包outputBag=null;
私有UltraAggregation currentAggregation=null;
公共无效累积(元组输入)引发IOException{
数据包值=(数据包)输入.get(0);
Aggregate aggregatedOutput=null;
outputBag=BagFactory.getInstance().newDefaultBag();
for(Iterator Iterator=values.Iterator();Iterator.hasNext();){
Tuple Tuple=iterator.next();
...
if(关于当前输入元组的某些条件){
//使用输入元组中的信息聚合输出
}否则{
//因为输入元组不适用于当前aggregateOutput
//返回当前aggregateOutput并应用输入元组
//到新的聚合输出
add(aggregatedOutput.getTuple());
aggregatedOutputTuple=新聚合(元组);
}
}
}
//当当前键中的所有元组都已传递到累加时调用
公共数据包getValue(){
//添加最终当前聚合
add(currentAggregation.getTuple());
返回输出包;
}
//这是在getValue()之后调用的
//不确定这些命令是否必要,因为它们在开始时会重复
公共空间清理(){
outputBag=null;
currentAggregation=null;
}
公共数据包exec(元组输入)引发IOException{
//同上^^^,但似乎从未调用过此函数。
}
公共模式输出模式(模式输入){
试一试{
返回新的模式(新的FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),input),bagSchema,DataType.BAG));
}捕获{FrontendException e){
e、 printStackTrace();
返回null;
}
}
类集合{
...
公共元组getTuple(){
Tuple output=TupleFactory.getInstance().newTuple(output\u Tuple\u SIZE);
试一试{
输出设置(0,val);
...
}捕获(执行异常){
e、 printStackTrace();
返回null;
}
}
...
}
}

每次附加到
outputBag
时,而不是每次从迭代器中获取元组时,都应该增加
spillCount
。只有当spillCount是1000的倍数且未满足if条件时,才会溢出,这可能不会经常发生(取决于逻辑)这也许可以解释为什么不同的溢出阈值没有太大差异


如果这不能解决您的问题,我将尝试扩展
AccumeratorEvalFunc
。在您的情况下,您实际上不需要访问整个包。您的实现适合于累加器样式的实现,因为您只需要访问当前元组。这可能会减少内存使用。实际上,您将有一个实例变量可以是累积最终输出的DataBag类型。您还可以为具有当前聚合的
aggregatedOutput
创建一个实例变量。调用
aggregate()
将1)更新当前聚合,或2)将当前聚合添加到
聚合输出
并开始新聚合。这基本上遵循for循环的主体。

开枪,我道歉;每次溢出后,我将
溢出计数
重置为0,但忘记将其包含在伪代码中。请检查更新帖子?我会查看你答案的另一部分。谢谢。在这种情况下重置为零不会有什么区别。而不是在调用
next()时增加
spillCount
,在追加到
outputBag
时,您应该增加它。
spillCount
的目的是跟踪袋子有多大。相反,您使用它来跟踪处理了多少个元组,这不是一回事。我为响应太晚表示歉意。我现在在追加时增加
spillCount
正在运行到
outputBag
,以跟踪袋子的大小。我仍然收到内存不足错误。接下来我将实现累加器Valfunc,并让您知道它是如何运行的。快速问题:Pig Book说明“Pig的包在超过某个大小阈值或只剩下一定数量的堆空间时自动处理溢出到磁盘的数据。溢出到磁盘的成本很高,应尽可能避免。但如果您必须在包中存储大量数据,Pig将对其进行管理。"如果Pig处理它,为什么我需要溢出?是的,我以前也看到过这个建议。但是在过去,我发现即使它自动溢出,有时也会遇到OOM错误。您有多少可用内存?您知道您为mapred.child.java.opts设置了什么吗?可能是您需要增加最大值默认的ap大小对于您的任务来说可能太低。您可以检查jobconf以查看此设置的值。如果低于此值,您可以尝试将其增加到-Xmx1024M之类的值。
public class Foo extends EvalFunc<DataBag> implements Accumulator<DataBag> {
    private DataBag outputBag = null;
    private UltraAggregation currentAggregation = null;
    
    public void accumulate(Tuple input) throws IOException {
        DataBag values = (DataBag)input.get(0);
        Aggregate aggregatedOutput = null;
        outputBag = BagFactory.getInstance().newDefaultBag();
        
        for (Iterator<Tuple> iterator = values.iterator(); iterator.hasNext();) {
            Tuple tuple = iterator.next();
            ...
            if (some condition regarding current input tuple){
                //do something to aggregatedOutput with information from input tuple
            } else {
                //Because input tuple does not apply to current aggregateOutput
                //return current aggregateOutput and apply input tuple
                //to new aggregateOutput
                outputBag.add(aggregatedOutput.getTuple());
                aggregatedOutputTuple = new Aggregate(tuple);
            }
        }
    }
    
    // Called when all tuples from current key have been passed to accumulate
    public DataBag getValue() {
        //Add final current aggregation
        outputBag.add(currentAggregation.getTuple());
        return outputBag;
    }
    // This is called after getValue()
    // Not sure if these commands are necessary as they are repeated in beginning of accumulate
    public void cleanup() {
        outputBag = null;
        currentAggregation = null;
    }
    
    public DataBag exec(Tuple input) throws IOException {
        // Same as above ^^ but this doesn't appear to ever be called.
    }
    
    public Schema outputSchema(Schema input) {
        try {
            return new Schema(new FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), bagSchema, DataType.BAG));
        } catch {FrontendException e) {
            e.printStackTrace();
            return null;
        }
    }
    
    class Aggregate {
        ...
        public Tuple getTuple() {
            Tuple output = TupleFactory.getInstance().newTuple(OUTPUT_TUPLE_SIZE);
            try {
                output.set(0, val);
                ...
            } catch (ExecException e) {
                e.printStackTrace();
                return null;
            }
        }
        ...
    }
}