Hadoop WholeFileInputFormat与CombineFileInputFormat
如何将WholeFileInputFormat与CombineFileInputFormat一起使用?假设我有10000个小二进制文件,目前使用的是WholeFileInputFormat。它工作正常,但效率不高,因为它创建了10000个映射程序。同样,每个映射任务只需几秒钟。这就是为什么我希望将更多文件传递给单个映射器以减少开销。一个选项是使用CombineFileInputFormat。我能够运行它,它创建了预期数量的映射程序,但它运行了无限长的时间。我认为我对getProgress的实现是错误的,并且注意到每个映射作业只读取分割的第一个文件,而不是移动到列表中的下一个文件 以下是我的自定义输入格式:Hadoop WholeFileInputFormat与CombineFileInputFormat,hadoop,mapreduce,hdfs,bigdata,yarn,Hadoop,Mapreduce,Hdfs,Bigdata,Yarn,如何将WholeFileInputFormat与CombineFileInputFormat一起使用?假设我有10000个小二进制文件,目前使用的是WholeFileInputFormat。它工作正常,但效率不高,因为它创建了10000个映射程序。同样,每个映射任务只需几秒钟。这就是为什么我希望将更多文件传递给单个映射器以减少开销。一个选项是使用CombineFileInputFormat。我能够运行它,它创建了预期数量的映射程序,但它运行了无限长的时间。我认为我对getProgress的实现是
public class MyCombineFileInputFormat extends CombineFileInputFormat<NullWritable, BytesWritable> {
public MyCombineFileInputFormat() {
super();
setMaxSplitSize(1048576);
}
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
@SuppressWarnings({ "unchecked", "rawtypes" })
@Override
public RecordReader<NullWritable, BytesWritable> getRecordReader(InputSplit split, JobConf job,
Reporter reporter) throws IOException {
return new CombineFileRecordReader<NullWritable, BytesWritable>(job, (CombineFileSplit) split, reporter,
(Class) MyCombineFileRecordReader.class);
}
公共类MyCombineFileInputFormat扩展了CombineFileInputFormat{
公共MyCombineFileInputFormat(){
超级();
setMaxSplitSize(1048576);
}
受保护的布尔isSplitable(JobContext上下文,路径文件){
返回false;
}
@SuppressWarnings({“unchecked”,“rawtypes”})
@凌驾
公共RecordReader getRecordReader(InputSplit拆分、JobConf作业、,
(记者)抛出IOException{
返回新的CombineFileRecordReader(作业,(CombineFileSplit)拆分,报告程序,
(Class)MyCombineFileRecordReader.Class);
}
}
这是我的自定义组合文件读取器:
public class MyCombineFileRecordReader implements RecordReader<NullWritable, BytesWritable> {
private NullWritable key = NullWritable.get();
private BytesWritable value = new BytesWritable();
private Path path;
private FileSystem fs;
private FileSplit filesplit;
private int totalNumOfPaths;
private static int processedPaths;
public static Logger LOGGER = Logger.getLogger(MyCombineFileRecordReader.class);
public MyCombineFileRecordReader(CombineFileSplit split, Configuration conf, Reporter reporter, Integer index)
throws IOException {
this.totalNumOfPaths = split.getNumPaths();
LOGGER.info("**** Total number of paths: " + totalNumOfPaths);
this.filesplit =
new FileSplit(split.getPath(index), split.getOffset(index), split.getLength(index),
split.getLocations());
this.path = split.getPath(index);
this.fs = this.path.getFileSystem(conf);
processedPaths = 0;
}
--- I think this is not right
@Override
public float getProgress() throws IOException {
if (processedPaths == totalNumOfPaths) {
LOGGER.info("**** Completed # of files");
return 1.0f;
} else {
return 0.0f;
}
}
--- I have found that this method is being called multiple times for the same file
@Override
public boolean next(NullWritable key, BytesWritable val) throws IOException {
if (filesplit != null) {
byte[] contents = new byte[(int) filesplit.getLength()];
LOGGER.info("**** Reading path: " + path);
FSDataInputStream in = null;
try {
in = fs.open(path);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
LOGGER.info("**** Processed path count: " + processedPaths);
processedPaths++;
} finally {
IOUtils.closeStream(in);
}
return true;
}
return false;
}
@Override
public NullWritable createKey() {
return key;
}
@Override
public BytesWritable createValue() {
return value;
}
公共类MyCombineFileRecordReader实现RecordReader{
私有NullWritable密钥=NullWritable.get();
私有BytesWritable值=新的BytesWritable();
专用路径;
专用文件系统fs;
私有文件分割文件分割;
私有整数路径;
私有静态int-processedpath;
公共静态记录器Logger=Logger.getLogger(MyCombineFileRecordReader.class);
公共MyCombineFileRecordReader(CombineFileSplit拆分、配置配置、报告器、整数索引)
抛出IOException{
this.totalNumofPath=split.getNumpath();
LOGGER.info(“****路径总数:“+TotalNumofPath”);
这个.filesplit=
新的FileSplit(split.getPath(index)、split.getOffset(index)、split.getLength(index),
split.getLocations());
this.path=split.getPath(索引);
this.fs=this.path.getFileSystem(conf);
ProcessedPath=0;
}
---我认为这是不对的
@凌驾
公共浮点getProgress()引发IOException{
if(ProcessedPath==TotalNumOfPath){
LOGGER.info(“****已完成文件数”);
返回1.0f;
}否则{
返回0.0f;
}
}
---我发现同一个文件多次调用此方法
@凌驾
公共布尔next(NullWritable键,BytesWritable val)引发IOException{
if(filesplit!=null){
byte[]contents=新字节[(int)filesplit.getLength()];
LOGGER.info(“****读取路径:“+path”);
FSDataInputStream in=null;
试一试{
in=fs.open(路径);
IOUtils.readFully(in,contents,0,contents.length);
value.set(contents,0,contents.length);
LOGGER.info(“****已处理路径计数:”+已处理路径);
processedPaths++;
}最后{
IOUtils.closeStream(in);
}
返回true;
}
返回false;
}
@凌驾
公共NullWritable createKey(){
返回键;
}
@凌驾
公共字节可写createValue(){
返回值;
}
}。。这是您所做工作的实现。看看是否有帮助。我想这和你要找的很相似。看看这个RecordReader实现@倒在地上的椰子才是真的!现在它开始工作了。看起来,根据拆分中的索引总数,正在多次调用读取器的构造函数。在检查了链接和源代码之后,它消除了我的困惑。谢谢!