Hadoop 如何删除合并器输出并在mapreduce最终输出中仅保留减速机输出_Hadoop_Mapreduce_Hadoop2

Hadoop 如何删除合并器输出并在mapreduce最终输出中仅保留减速机输出

hadoop mapreduce

Hadoop 如何删除合并器输出并在mapreduce最终输出中仅保留减速机输出,hadoop,mapreduce,hadoop2,Hadoop,Mapreduce,Hadoop2,您好，我正在运行一个从HBase读取记录并写入文本文件的应用程序我在我的应用程序中使用了组合器，也使用了自定义分区器。我在应用程序中使用了41 reducer，因为我需要创建满足自定义分区器条件的40 reducer输出文件所有工作正常，但当我在应用程序中使用combiner时，它会为每个区域或每个映射器创建映射输出文件例如，我的应用程序中有40个区域，因此启动了40个映射程序，然后创建了40个映射输出文件。但reducer无法组合所有贴图输出并生成最终的reducer输出文件，该文件将是

您好，我正在运行一个从HBase读取记录并写入文本文件的应用程序

我在我的应用程序中使用了组合器，也使用了自定义分区器。我在应用程序中使用了41 reducer，因为我需要创建满足自定义分区器条件的40 reducer输出文件

所有工作正常，但当我在应用程序中使用combiner时，它会为每个区域或每个映射器创建映射输出文件

例如，我的应用程序中有40个区域，因此启动了40个映射程序，然后创建了40个映射输出文件。但reducer无法组合所有贴图输出并生成最终的reducer输出文件，该文件将是40个reducer输出文件

import java.io.IOException;
import org.apache.log4j.Logger;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;

public class CommonCombiner extends Reducer<NullWritable, Text, NullWritable, Text> {

    private Logger logger = Logger.getLogger(CommonCombiner.class);
    private MultipleOutputs<NullWritable, Text> multipleOutputs;
    String strName = "";
    private static final String DATA_SEPERATOR = "\\|\\!\\|";

    public void setup(Context context) {
        logger.info("Inside Combiner.");
        multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
    }

    @Override
    public void reduce(NullWritable Key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {

        for (Text value : values) {
            final String valueStr = value.toString();
            StringBuilder sb = new StringBuilder();
            if ("".equals(strName) && strName.length() == 0) {
                String[] strArrFileName = valueStr.split(DATA_SEPERATOR);
                String strFullFileName[] = strArrFileName[1].split("\\|\\^\\|");

                strName = strFullFileName[strFullFileName.length - 1];


                String strArrvalueStr[] = valueStr.split(DATA_SEPERATOR);
                if (!strArrvalueStr[0].contains(HbaseBulkLoadMapperConstants.FF_ACTION)) {
                    sb.append(strArrvalueStr[0] + "|!|");
                }
                multipleOutputs.write(NullWritable.get(), new Text(sb.toString()), strName);
                context.getCounter(Counters.FILE_DATA_COUNTER).increment(1);


            }

        }
    }


    public void cleanup(Context context) throws IOException, InterruptedException {
        multipleOutputs.close();
    }
}

文件中的数据正确，但没有增加任何文件

你知道我怎样才能只得到reducer的输出文件吗

import java.io.IOException;
import org.apache.log4j.Logger;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;

public class CommonCombiner extends Reducer<NullWritable, Text, NullWritable, Text> {

    private Logger logger = Logger.getLogger(CommonCombiner.class);
    private MultipleOutputs<NullWritable, Text> multipleOutputs;
    String strName = "";
    private static final String DATA_SEPERATOR = "\\|\\!\\|";

    public void setup(Context context) {
        logger.info("Inside Combiner.");
        multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
    }

    @Override
    public void reduce(NullWritable Key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {

        for (Text value : values) {
            final String valueStr = value.toString();
            StringBuilder sb = new StringBuilder();
            if ("".equals(strName) && strName.length() == 0) {
                String[] strArrFileName = valueStr.split(DATA_SEPERATOR);
                String strFullFileName[] = strArrFileName[1].split("\\|\\^\\|");

                strName = strFullFileName[strFullFileName.length - 1];


                String strArrvalueStr[] = valueStr.split(DATA_SEPERATOR);
                if (!strArrvalueStr[0].contains(HbaseBulkLoadMapperConstants.FF_ACTION)) {
                    sb.append(strArrvalueStr[0] + "|!|");
                }
                multipleOutputs.write(NullWritable.get(), new Text(sb.toString()), strName);
                context.getCounter(Counters.FILE_DATA_COUNTER).increment(1);


            }

        }
    }


    public void cleanup(Context context) throws IOException, InterruptedException {
        multipleOutputs.close();
    }
}

import java.io.IOException；
导入org.apache.log4j.Logger；
导入org.apache.hadoop.io.NullWritable；
导入org.apache.hadoop.io.Text；
导入org.apache.hadoop.mapreduce.Reducer；
导入org.apache.hadoop.mapreduce.lib.output.MultipleOutputs；
公共类CommonCombiner扩展了Reducer{
专用记录器=Logger.getLogger（CommonCombiner.class）；
专用多路输出多路输出；
字符串strName=“”；
私有静态最终字符串数据\u分隔符=“\\\\\！\\\\\\”；
公共无效设置（上下文）{
logger.info（“内部组合器”）；
multipleoutput=新的multipleoutput（上下文）；
}
@凌驾
公共void reduce（NullWritable键、Iterable值、上下文）
抛出IOException、InterruptedException{
用于（文本值：值）{
最终字符串值str=value.toString（）；
StringBuilder sb=新的StringBuilder（）；
如果（“.equals（strName）&&strName.length（）==0）{
字符串[]strArrFileName=valueStr.split（数据分隔符）；
字符串strFullFileName[]=strArrFileName[1]。拆分（\\\\\\\^\\\\\\\\\）；
strName=strFullFileName[strFullFileName.length-1]；
字符串strArrvalueStr[]=valueStr.split（数据分隔符）；
如果（！strArrvalueStr[0]。包含（HbaseBulkLoadMapperConstants.FF_操作））{
sb.追加（strArrvalueStr[0]+“|！|””；
}
multipleOutputs.write（nullwriteable.get（）、新文本（sb.toString（）、strName）；
context.getCounter（Counters.FILE\u DATA\u COUNTER）.increment（1）；
}
}
}
公共无效清除（上下文上下文上下文）引发IOException、InterruptedException{
multipleoutput.close（）；
}
}

让我们了解基本知识

合并器是一种优化，既可以在映射器上运行，也可以在reduce（reduce的合并阶段）（fetch-merge-reduce阶段）中运行

找出数据中密钥的分布，给定的映射器是否访问同一个密钥的很多。如果是，则combiner正在帮助其他映射器，否则它没有效果

1 K个区域没有保证它们被平等划分的区域。你有一些热区

找到热点区域并拆分

请遵循：

让我们把基本知识弄清楚

合并器是一种优化，既可以在映射器上运行，也可以在reduce（reduce的合并阶段）（fetch-merge-reduce阶段）中运行

找出数据中密钥的分布，给定的映射器是否访问同一个密钥的很多。如果是，则combiner正在帮助其他映射器，否则它没有效果

1 K个区域没有保证它们被平等划分的区域。你有一些热区

找到热点区域并拆分

请注意：

您没有从组合器输出任何数据，以便减速器使用。在您的组合器中，您正在使用：

multipleOutputs.write（nullwriteable.get（）、新文本（sb.toString（））、strName）
这不是你写数据的方式，在不同的阶段之间使用，即从映射器或组合器到reduce阶段。您应该使用：
context.write（）

在需要多个文件的地方，多路输出只是将额外文件写入磁盘的一种方式。我从未见过它在组合器中使用。
您没有从组合器输出任何数据以供减速器使用。在您的组合器中，您正在使用：
multipleOutputs.write（nullwriteable.get（）、新文本（sb.toString（））、strName）
这不是你写数据的方式，在不同的阶段之间使用，即从映射器或组合器到reduce阶段。您应该使用：
context.write（）

在需要多个文件的地方，多路输出只是将额外文件写入磁盘的一种方式。我从未见过在组合器中使用它。
组合器创建输出文件是什么意思？它不应该那样工作。合并器作为本地减速机在映射端运行，然后由减速机生成输出文件。由映射程序（在本地文件系统中，而不是在HDFS中）编写的tmp文件将在作业完成后删除，只是注释掉它。@BinaryNerd是的，当我使用combiner时，会创建许多小的输出文件。如果我注释掉combiner类，我的工作会变慢。@BinaryNerd我得到combiner输出，这就是问题所在。如果我使用reducer，我最终会得到reducer输出。你是说当你包含合并器？合并器创建输出文件是什么意思？它不应该那样工作。合并器作为本地减速机在映射端运行，然后由减速机生成输出文件。由映射程序（在本地文件系统中，而不是在HDFS中）编写的tmp文件将在作业完成后删除，只是注释掉它。@BinaryNerd是的，当我使用combiner时，会创建许多小的输出文件。如果我注释掉combiner类，我的工作会变慢。@BinaryNerd我得到的是combiner输出，这就是问题所在