Hadoop WholeFileInputFormat与CombineFileInputFormat_Hadoop_Mapreduce_Hdfs_Bigdata_Yarn

Hadoop WholeFileInputFormat与CombineFileInputFormat

hadoop mapreduce

Hadoop WholeFileInputFormat与CombineFileInputFormat,hadoop,mapreduce,hdfs,bigdata,yarn,Hadoop,Mapreduce,Hdfs,Bigdata,Yarn,如何将WholeFileInputFormat与CombineFileInputFormat一起使用？假设我有10000个小二进制文件，目前使用的是WholeFileInputFormat。它工作正常，但效率不高，因为它创建了10000个映射程序。同样，每个映射任务只需几秒钟。这就是为什么我希望将更多文件传递给单个映射器以减少开销。一个选项是使用CombineFileInputFormat。我能够运行它，它创建了预期数量的映射程序，但它运行了无限长的时间。我认为我对getProgress的实现是

如何将WholeFileInputFormat与CombineFileInputFormat一起使用？假设我有10000个小二进制文件，目前使用的是WholeFileInputFormat。它工作正常，但效率不高，因为它创建了10000个映射程序。同样，每个映射任务只需几秒钟。这就是为什么我希望将更多文件传递给单个映射器以减少开销。一个选项是使用CombineFileInputFormat。我能够运行它，它创建了预期数量的映射程序，但它运行了无限长的时间。我认为我对getProgress的实现是错误的，并且注意到每个映射作业只读取分割的第一个文件，而不是移动到列表中的下一个文件

以下是我的自定义输入格式：

public class MyCombineFileInputFormat extends CombineFileInputFormat<NullWritable, BytesWritable> {

public MyCombineFileInputFormat() {
    super();
    setMaxSplitSize(1048576);
}


protected boolean isSplitable(JobContext context, Path file) {
    return false;
}


@SuppressWarnings({ "unchecked", "rawtypes" })
@Override
public RecordReader<NullWritable, BytesWritable> getRecordReader(InputSplit split, JobConf job,
        Reporter reporter) throws IOException {
    return new CombineFileRecordReader<NullWritable, BytesWritable>(job, (CombineFileSplit) split, reporter,
        (Class) MyCombineFileRecordReader.class);
}

公共类MyCombineFileInputFormat扩展了CombineFileInputFormat{
公共MyCombineFileInputFormat（）{
超级（）；
setMaxSplitSize（1048576）；
}
受保护的布尔isSplitable（JobContext上下文，路径文件）{
返回false；
}
@SuppressWarnings（{“unchecked”，“rawtypes”}）
@凌驾
公共RecordReader getRecordReader（InputSplit拆分、JobConf作业、，
（记者）抛出IOException{
返回新的CombineFileRecordReader（作业，（CombineFileSplit）拆分，报告程序，
（Class）MyCombineFileRecordReader.Class）；
}

}

这是我的自定义组合文件读取器：

public class MyCombineFileRecordReader implements RecordReader<NullWritable, BytesWritable> {

private NullWritable key = NullWritable.get();

private BytesWritable value = new BytesWritable();
private Path path;
private FileSystem fs;
private FileSplit filesplit;
private int totalNumOfPaths;
private static int processedPaths;

public static Logger LOGGER = Logger.getLogger(MyCombineFileRecordReader.class);


public MyCombineFileRecordReader(CombineFileSplit split, Configuration conf, Reporter reporter, Integer index)
        throws IOException {
    this.totalNumOfPaths = split.getNumPaths();
    LOGGER.info("**** Total number of paths: " + totalNumOfPaths);
    this.filesplit =
            new FileSplit(split.getPath(index), split.getOffset(index), split.getLength(index),
                split.getLocations());
    this.path = split.getPath(index);
    this.fs = this.path.getFileSystem(conf);
    processedPaths = 0;
}

--- I think this is not right
@Override
public float getProgress() throws IOException {
    if (processedPaths == totalNumOfPaths) {
        LOGGER.info("**** Completed # of files");
        return 1.0f;
    } else {
        return 0.0f;
    }
}

--- I have found that this method is being called multiple times for the same file 
@Override
public boolean next(NullWritable key, BytesWritable val) throws IOException {
    if (filesplit != null) {
        byte[] contents = new byte[(int) filesplit.getLength()];
        LOGGER.info("**** Reading path: " + path);
        FSDataInputStream in = null;
        try {
            in = fs.open(path);
            IOUtils.readFully(in, contents, 0, contents.length);
            value.set(contents, 0, contents.length);
            LOGGER.info("**** Processed path count: " + processedPaths);
            processedPaths++;
        } finally {
            IOUtils.closeStream(in);
        }
        return true;
    }
    return false;
}


@Override
public NullWritable createKey() {
    return key;
}
@Override
public BytesWritable createValue() {
    return value;
}

公共类MyCombineFileRecordReader实现RecordReader{
私有NullWritable密钥=NullWritable.get（）；
私有BytesWritable值=新的BytesWritable（）；
专用路径；
专用文件系统fs；
私有文件分割文件分割；
私有整数路径；
私有静态int-processedpath；
公共静态记录器Logger=Logger.getLogger（MyCombineFileRecordReader.class）；
公共MyCombineFileRecordReader（CombineFileSplit拆分、配置配置、报告器、整数索引）
抛出IOException{
this.totalNumofPath=split.getNumpath（）；
LOGGER.info（“****路径总数：“+TotalNumofPath”）；
这个.filesplit=
新的FileSplit（split.getPath（index）、split.getOffset（index）、split.getLength（index），
split.getLocations（））；
this.path=split.getPath（索引）；
this.fs=this.path.getFileSystem（conf）；
ProcessedPath=0；
}
---我认为这是不对的
@凌驾
公共浮点getProgress（）引发IOException{
if（ProcessedPath==TotalNumOfPath）{
LOGGER.info（“****已完成文件数”）；
返回1.0f；
}否则{
返回0.0f；
}
}
---我发现同一个文件多次调用此方法
@凌驾
公共布尔next（NullWritable键，BytesWritable val）引发IOException{
if（filesplit！=null）{
byte[]contents=新字节[（int）filesplit.getLength（）]；
LOGGER.info（“****读取路径：“+path”）；
FSDataInputStream in=null；
试一试{
in=fs.open（路径）；
IOUtils.readFully（in，contents，0，contents.length）；
value.set（contents，0，contents.length）；
LOGGER.info（“****已处理路径计数：”+已处理路径）；
processedPaths++；
}最后{
IOUtils.closeStream（in）；
}
返回true；
}
返回false；
}
@凌驾
公共NullWritable createKey（）{
返回键；
}
@凌驾
公共字节可写createValue（）{
返回值；
}

}

。。这是您所做工作的实现。看看是否有帮助。我想这和你要找的很相似。看看这个RecordReader实现@倒在地上的椰子才是真的！现在它开始工作了。看起来，根据拆分中的索引总数，正在多次调用读取器的构造函数。在检查了链接和源代码之后，它消除了我的困惑。谢谢！