Java Apache Pig未使用适当的RecordWriter或OutputCommitter 问题
我正在使用自定义的Java Apache Pig未使用适当的RecordWriter或OutputCommitter 问题,java,hadoop,apache-pig,emr,elastic-map-reduce,Java,Hadoop,Apache Pig,Emr,Elastic Map Reduce,我正在使用自定义的StoreFunc、OutputFormat和OutputCommitter与Pig一起使用。我遇到的问题是Pig没有调用我在OutputFormat中定义的一些方法,这些方法返回适当的RecordWriter和OutputCommitter。这会导致数据被写到其他地方(老实说,我不确定在哪里),而不是预期的目的地。猪在工作期间不会抛出任何错误 一个简单的Pig脚本示例: data = LOAD '<url>' USING com.company.CompanyLo
StoreFunc
、OutputFormat
和OutputCommitter
与Pig一起使用。我遇到的问题是Pig没有调用我在OutputFormat
中定义的一些方法,这些方法返回适当的RecordWriter
和OutputCommitter
。这会导致数据被写到其他地方(老实说,我不确定在哪里),而不是预期的目的地。猪在工作期间不会抛出任何错误
一个简单的Pig脚本示例:
data = LOAD '<url>' USING com.company.CompanyLoader();
STORE data INTO '<other url>' USING com.company.CompanyStorage();
CustomOutputFormat.java
public class MyStoreFunc extends StoreFunc {
CustomOutputFormat OutputFormat = new CustomOutputFormat();
private RecordWriter out;
public OutputFormat getOutputFormat() throws IOException {
LOG.info("getOutputFormat called.");
return outputFormat;
}
public void prepareToWrite(final RecordWriter writer) throws IOException {
out = writer;
LOG.info("Using RecordWriter: " + writer.getClass());
// other preparation
}
public void setStoreLocation(final String location, final Job job) {
try {
LOG.info("Output format class is set to: " + job.getOutputFormatClass());
} catch (ClassNotFoundException e) {
LOG.info("Output foramt class is undefined.");
}
LOG.info("Output committer is " + job.getConfiguration().get("mapred.output.committer.class", "undefined"));
// other preparation
}
// other stuff...
}
public class CustomOutputFormat<K, V> extends OutputFormat<K, V> {
public CustomOutputFormat() {
LOG.info("CustomOutputFormat created.");
}
public void checkOutputSpecs(final JobContext context) throws IOException {
LOG.info("checkOutputSpecs called.");
try {
LOG.info("output format = " + context.getOutputFormatClass());
} catch (ClassNotFoundException e) {
LOG.info("output format not found.");
}
// Check some stuff in configuration
}
public OutputCommitter getOutputCommitter(final TaskAttemptContext ctx) {
LOG.info("getOutputCommitter called.");
return new CustomOutputCommitter();
}
public RecordWriter<K, V> getRecordWriter(final TaskAttemptContext ctx) {
LOG.info("getRecordWriter called.");
return new CustomRecordWriter<K, V>();
}
}
我注意到的事情:
和getOutputFormat
从未在getRecordWriter
上调用。为什么?CustomOutputFormat
- MyStoreFunc上从不调用
。什么prepareToWrite
上调用了CustomOutputFormat
,因此很明显,Pig“知道”这个类,并从checkOutputSpecs
获取它MyStoreFunc
提前谢谢。您能分享实际代码吗?我不能分享原始源代码,但我已经模拟了与原始代码结构非常相似的代码,并提供了示例日志输出。
2015-06-30 00:08:32,100 [main] INFO CustomOutputFormat created.
2015-06-30 00:08:32,104 [main] INFO Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,110 [main] INFO Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,120 [main] INFO getOutputFormat called.
2015-06-30 00:08:32,124 [main] INFO checkOutputSpecs called.
2015-06-30 00:08:32,135 [main] INFO output format=class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,140 [main] INFO Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,152 [main] INFO Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,154 [JobControl] INFO CustomOutputFormat created.
2015-06-30 00:08:32,156 [JobControl] INFO Output format class is set to: class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
2015-06-30 00:08:32,159 [JobControl] INFO Output committer is org.apache.hadoop.mapred.DirectFileOutputCommitter
2015-06-30 00:08:32,166 [JobControl] INFO getOutputFormat called.
2015-06-30 00:08:32,169 [JobControl] INFO checkOutputSpecs called.
2015-06-30 00:08:32,175 [JobControl] INFO output format=class org.apache.hadoop.mapreduce.lib.output.TextOutputFormat