Java Apache Pig未使用适当的RecordWriter或OutputCommitter 问题_Java_Hadoop_Apache Pig_Emr_Elastic Map Reduce

Java Apache Pig未使用适当的RecordWriter或OutputCommitter 问题

java hadoop apache-pig

Java Apache Pig未使用适当的RecordWriter或OutputCommitter 问题,java,hadoop,apache-pig,emr,elastic-map-reduce,Java,Hadoop,Apache Pig,Emr,Elastic Map Reduce,我正在使用自定义的StoreFunc、OutputFormat和OutputCommitter与Pig一起使用。我遇到的问题是Pig没有调用我在OutputFormat中定义的一些方法，这些方法返回适当的RecordWriter和OutputCommitter。这会导致数据被写到其他地方（老实说，我不确定在哪里），而不是预期的目的地。猪在工作期间不会抛出任何错误一个简单的Pig脚本示例： data = LOAD '<url>' USING com.company.CompanyLo

我正在使用自定义的

StoreFunc

、

OutputFormat

和

OutputCommitter

与Pig一起使用。我遇到的问题是Pig没有调用我在

OutputFormat

中定义的一些方法，这些方法返回适当的

RecordWriter

和

OutputCommitter

。这会导致数据被写到其他地方（老实说，我不确定在哪里），而不是预期的目的地。猪在工作期间不会抛出任何错误

一个简单的Pig脚本示例：

data = LOAD '<url>' USING com.company.CompanyLoader();
STORE data INTO '<other url>' USING com.company.CompanyStorage();

CustomOutputFormat.java

public class MyStoreFunc extends StoreFunc {

    CustomOutputFormat OutputFormat = new CustomOutputFormat();
    private RecordWriter out;

    public OutputFormat getOutputFormat() throws IOException {
        LOG.info("getOutputFormat called.");
        return outputFormat;
    }

    public void prepareToWrite(final RecordWriter writer) throws IOException {
        out = writer;
        LOG.info("Using RecordWriter: " + writer.getClass());

        // other preparation
    }

    public void setStoreLocation(final String location, final Job job) {
        try {
            LOG.info("Output format class is set to: " + job.getOutputFormatClass());
        } catch (ClassNotFoundException e) {
            LOG.info("Output foramt class is undefined.");
        }
        LOG.info("Output committer is " + job.getConfiguration().get("mapred.output.committer.class", "undefined"));
        // other preparation
    }

    // other stuff...
}

public class CustomOutputFormat<K, V> extends OutputFormat<K, V> {

    public CustomOutputFormat() {
        LOG.info("CustomOutputFormat created.");
    }

    public void checkOutputSpecs(final JobContext context) throws IOException {
        LOG.info("checkOutputSpecs called.");
        try {
            LOG.info("output format = " + context.getOutputFormatClass());
        } catch (ClassNotFoundException e) {
            LOG.info("output format not found.");
        }
        // Check some stuff in configuration
    }

    public OutputCommitter getOutputCommitter(final TaskAttemptContext ctx) {
        LOG.info("getOutputCommitter called.");
        return new CustomOutputCommitter();
    }

    public RecordWriter<K, V> getRecordWriter(final TaskAttemptContext ctx) {
        LOG.info("getRecordWriter called.");
        return new CustomRecordWriter<K, V>();
    }

}

我注意到的事情：

getOutputFormat

和

getRecordWriter

从未在

CustomOutputFormat

上调用。为什么?

MyStoreFunc上从不调用
```
prepareToWrite
```
。什么
```
CustomOutputFormat
```
上调用了
```
checkOutputSpecs
```
，因此很明显，Pig“知道”这个类，并从
```
MyStoreFunc
```
获取它

提前谢谢。

您能分享实际代码吗？我不能分享原始源代码，但我已经模拟了与原始代码结构非常相似的代码，并提供了示例日志输出。

2015-06-30 00:08:32,100 [main] INFO    CustomOutputFormat created.
2015-06-30 00:08:32,104 [main] INFO    Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,110 [main] INFO    Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,120 [main] INFO    getOutputFormat called.
2015-06-30 00:08:32,124 [main] INFO    checkOutputSpecs called.
2015-06-30 00:08:32,135 [main] INFO    output format=class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,140 [main] INFO    Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,152 [main] INFO    Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,154 [JobControl] INFO    CustomOutputFormat created.
2015-06-30 00:08:32,156 [JobControl] INFO    Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,159 [JobControl] INFO    Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,166 [JobControl] INFO    getOutputFormat called.
2015-06-30 00:08:32,169 [JobControl] INFO    checkOutputSpecs called.
2015-06-30 00:08:32,175 [JobControl] INFO    output format=class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat