Hadoop WholeFileInputFormat与CombineFileInputFormat

Hadoop WholeFileInputFormat与CombineFileInputFormat,hadoop,mapreduce,hdfs,bigdata,yarn,Hadoop,Mapreduce,Hdfs,Bigdata,Yarn,如何将WholeFileInputFormat与CombineFileInputFormat一起使用?假设我有10000个小二进制文件,目前使用的是WholeFileInputFormat。它工作正常,但效率不高,因为它创建了10000个映射程序。同样,每个映射任务只需几秒钟。这就是为什么我希望将更多文件传递给单个映射器以减少开销。一个选项是使用CombineFileInputFormat。我能够运行它,它创建了预期数量的映射程序,但它运行了无限长的时间。我认为我对getProgress的实现是

如何将WholeFileInputFormat与CombineFileInputFormat一起使用?假设我有10000个小二进制文件,目前使用的是WholeFileInputFormat。它工作正常,但效率不高,因为它创建了10000个映射程序。同样,每个映射任务只需几秒钟。这就是为什么我希望将更多文件传递给单个映射器以减少开销。一个选项是使用CombineFileInputFormat。我能够运行它,它创建了预期数量的映射程序,但它运行了无限长的时间。我认为我对getProgress的实现是错误的,并且注意到每个映射作业只读取分割的第一个文件,而不是移动到列表中的下一个文件

以下是我的自定义输入格式:

public class MyCombineFileInputFormat extends CombineFileInputFormat<NullWritable, BytesWritable> {

public MyCombineFileInputFormat() {
    super();
    setMaxSplitSize(1048576);
}


protected boolean isSplitable(JobContext context, Path file) {
    return false;
}


@SuppressWarnings({ "unchecked", "rawtypes" })
@Override
public RecordReader<NullWritable, BytesWritable> getRecordReader(InputSplit split, JobConf job,
        Reporter reporter) throws IOException {
    return new CombineFileRecordReader<NullWritable, BytesWritable>(job, (CombineFileSplit) split, reporter,
        (Class) MyCombineFileRecordReader.class);
}
公共类MyCombineFileInputFormat扩展了CombineFileInputFormat{
公共MyCombineFileInputFormat(){
超级();
setMaxSplitSize(1048576);
}
受保护的布尔isSplitable(JobContext上下文,路径文件){
返回false;
}
@SuppressWarnings({“unchecked”,“rawtypes”})
@凌驾
公共RecordReader getRecordReader(InputSplit拆分、JobConf作业、,
(记者)抛出IOException{
返回新的CombineFileRecordReader(作业,(CombineFileSplit)拆分,报告程序,
(Class)MyCombineFileRecordReader.Class);
}
}

这是我的自定义组合文件读取器:

public class MyCombineFileRecordReader implements RecordReader<NullWritable, BytesWritable> {

private NullWritable key = NullWritable.get();

private BytesWritable value = new BytesWritable();
private Path path;
private FileSystem fs;
private FileSplit filesplit;
private int totalNumOfPaths;
private static int processedPaths;

public static Logger LOGGER = Logger.getLogger(MyCombineFileRecordReader.class);


public MyCombineFileRecordReader(CombineFileSplit split, Configuration conf, Reporter reporter, Integer index)
        throws IOException {
    this.totalNumOfPaths = split.getNumPaths();
    LOGGER.info("**** Total number of paths: " + totalNumOfPaths);
    this.filesplit =
            new FileSplit(split.getPath(index), split.getOffset(index), split.getLength(index),
                split.getLocations());
    this.path = split.getPath(index);
    this.fs = this.path.getFileSystem(conf);
    processedPaths = 0;
}

--- I think this is not right
@Override
public float getProgress() throws IOException {
    if (processedPaths == totalNumOfPaths) {
        LOGGER.info("**** Completed # of files");
        return 1.0f;
    } else {
        return 0.0f;
    }
}

--- I have found that this method is being called multiple times for the same file 
@Override
public boolean next(NullWritable key, BytesWritable val) throws IOException {
    if (filesplit != null) {
        byte[] contents = new byte[(int) filesplit.getLength()];
        LOGGER.info("**** Reading path: " + path);
        FSDataInputStream in = null;
        try {
            in = fs.open(path);
            IOUtils.readFully(in, contents, 0, contents.length);
            value.set(contents, 0, contents.length);
            LOGGER.info("**** Processed path count: " + processedPaths);
            processedPaths++;
        } finally {
            IOUtils.closeStream(in);
        }
        return true;
    }
    return false;
}


@Override
public NullWritable createKey() {
    return key;
}
@Override
public BytesWritable createValue() {
    return value;
}
公共类MyCombineFileRecordReader实现RecordReader{
私有NullWritable密钥=NullWritable.get();
私有BytesWritable值=新的BytesWritable();
专用路径;
专用文件系统fs;
私有文件分割文件分割;
私有整数路径;
私有静态int-processedpath;
公共静态记录器Logger=Logger.getLogger(MyCombineFileRecordReader.class);
公共MyCombineFileRecordReader(CombineFileSplit拆分、配置配置、报告器、整数索引)
抛出IOException{
this.totalNumofPath=split.getNumpath();
LOGGER.info(“****路径总数:“+TotalNumofPath”);
这个.filesplit=
新的FileSplit(split.getPath(index)、split.getOffset(index)、split.getLength(index),
split.getLocations());
this.path=split.getPath(索引);
this.fs=this.path.getFileSystem(conf);
ProcessedPath=0;
}
---我认为这是不对的
@凌驾
公共浮点getProgress()引发IOException{
if(ProcessedPath==TotalNumOfPath){
LOGGER.info(“****已完成文件数”);
返回1.0f;
}否则{
返回0.0f;
}
}
---我发现同一个文件多次调用此方法
@凌驾
公共布尔next(NullWritable键,BytesWritable val)引发IOException{
if(filesplit!=null){
byte[]contents=新字节[(int)filesplit.getLength()];
LOGGER.info(“****读取路径:“+path”);
FSDataInputStream in=null;
试一试{
in=fs.open(路径);
IOUtils.readFully(in,contents,0,contents.length);
value.set(contents,0,contents.length);
LOGGER.info(“****已处理路径计数:”+已处理路径);
processedPaths++;
}最后{
IOUtils.closeStream(in);
}
返回true;
}
返回false;
}
@凌驾
公共NullWritable createKey(){
返回键;
}
@凌驾
公共字节可写createValue(){
返回值;
}

}

。。这是您所做工作的实现。看看是否有帮助。我想这和你要找的很相似。看看这个RecordReader实现@倒在地上的椰子才是真的!现在它开始工作了。看起来,根据拆分中的索引总数,正在多次调用读取器的构造函数。在检查了链接和源代码之后,它消除了我的困惑。谢谢!