Java Apache Pig未使用适当的RecordWriter或OutputCommitter 问题

Java Apache Pig未使用适当的RecordWriter或OutputCommitter 问题,java,hadoop,apache-pig,emr,elastic-map-reduce,Java,Hadoop,Apache Pig,Emr,Elastic Map Reduce,我正在使用自定义的StoreFunc、OutputFormat和OutputCommitter与Pig一起使用。我遇到的问题是Pig没有调用我在OutputFormat中定义的一些方法,这些方法返回适当的RecordWriter和OutputCommitter。这会导致数据被写到其他地方(老实说,我不确定在哪里),而不是预期的目的地。猪在工作期间不会抛出任何错误 一个简单的Pig脚本示例: data = LOAD '<url>' USING com.company.CompanyLo

我正在使用自定义的
StoreFunc
OutputFormat
OutputCommitter
与Pig一起使用。我遇到的问题是Pig没有调用我在
OutputFormat
中定义的一些方法,这些方法返回适当的
RecordWriter
OutputCommitter
。这会导致数据被写到其他地方(老实说,我不确定在哪里),而不是预期的目的地。猪在工作期间不会抛出任何错误

一个简单的Pig脚本示例:

data = LOAD '<url>' USING com.company.CompanyLoader();
STORE data INTO '<other url>' USING com.company.CompanyStorage();
CustomOutputFormat.java

public class MyStoreFunc extends StoreFunc {

    CustomOutputFormat OutputFormat = new CustomOutputFormat();
    private RecordWriter out;

    public OutputFormat getOutputFormat() throws IOException {
        LOG.info("getOutputFormat called.");
        return outputFormat;
    }

    public void prepareToWrite(final RecordWriter writer) throws IOException {
        out = writer;
        LOG.info("Using RecordWriter: " + writer.getClass());

        // other preparation
    }

    public void setStoreLocation(final String location, final Job job) {
        try {
            LOG.info("Output format class is set to: " + job.getOutputFormatClass());
        } catch (ClassNotFoundException e) {
            LOG.info("Output foramt class is undefined.");
        }
        LOG.info("Output committer is " + job.getConfiguration().get("mapred.output.committer.class", "undefined"));
        // other preparation
    }

    // other stuff...
}
public class CustomOutputFormat<K, V> extends OutputFormat<K, V> {

    public CustomOutputFormat() {
        LOG.info("CustomOutputFormat created.");
    }

    public void checkOutputSpecs(final JobContext context) throws IOException {
        LOG.info("checkOutputSpecs called.");
        try {
            LOG.info("output format = " + context.getOutputFormatClass());
        } catch (ClassNotFoundException e) {
            LOG.info("output format not found.");
        }
        // Check some stuff in configuration
    }

    public OutputCommitter getOutputCommitter(final TaskAttemptContext ctx) {
        LOG.info("getOutputCommitter called.");
        return new CustomOutputCommitter();
    }

    public RecordWriter<K, V> getRecordWriter(final TaskAttemptContext ctx) {
        LOG.info("getRecordWriter called.");
        return new CustomRecordWriter<K, V>();
    }

}
我注意到的事情:

  • getOutputFormat
    getRecordWriter
    从未在
    CustomOutputFormat
    上调用。为什么?
  • MyStoreFunc上从不调用
    prepareToWrite
    。什么
  • CustomOutputFormat
    上调用了
    checkOutputSpecs
    ,因此很明显,Pig“知道”这个类,并从
    MyStoreFunc
    获取它

提前谢谢。

您能分享实际代码吗?我不能分享原始源代码,但我已经模拟了与原始代码结构非常相似的代码,并提供了示例日志输出。
2015-06-30 00:08:32,100 [main] INFO    CustomOutputFormat created.
2015-06-30 00:08:32,104 [main] INFO    Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,110 [main] INFO    Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,120 [main] INFO    getOutputFormat called.
2015-06-30 00:08:32,124 [main] INFO    checkOutputSpecs called.
2015-06-30 00:08:32,135 [main] INFO    output format=class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,140 [main] INFO    Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,152 [main] INFO    Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,154 [JobControl] INFO    CustomOutputFormat created.
2015-06-30 00:08:32,156 [JobControl] INFO    Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,159 [JobControl] INFO    Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,166 [JobControl] INFO    getOutputFormat called.
2015-06-30 00:08:32,169 [JobControl] INFO    checkOutputSpecs called.
2015-06-30 00:08:32,175 [JobControl] INFO    output format=class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat