Java 使用MapReduce查找数字的平均值_Java_Hadoop_Mapreduce_Distributed

Java 使用MapReduce查找数字的平均值

java hadoop mapreduce

Java 使用MapReduce查找数字的平均值,java,hadoop,mapreduce,distributed,Java,Hadoop,Mapreduce,Distributed,我一直在尝试编写一些代码来使用MapReduce查找数字的平均值我试图使用全局计数器来达到我的目标，但我无法在映射器的map方法中设置计数器值，也无法在还原器的reduce方法中检索计数器值我是否必须在map中使用全局计数器（例如，使用提供的报告器的计数器（键、金额））？或者你会建议使用任何不同的逻辑来获得一些数字的平均值吗？逻辑非常简单：如果所有数字都具有相同的键，则映射程序将发送所有您想要查找具有相同键的平均值的值。因此，在reducer中，可以对迭代器中的值求和。然后，您可以在迭代器

我一直在尝试编写一些代码来使用MapReduce查找数字的平均值

我试图使用全局计数器来达到我的目标，但我无法在映射器的

map

方法中设置计数器值，也无法在还原器的

reduce

方法中检索计数器值

我是否必须在

map

中使用全局计数器（例如，使用提供的

报告器的计数器（键、金额）
）？或者你会建议使用任何不同的逻辑来获得一些数字的平均值吗？逻辑非常简单：
如果所有数字都具有相同的键，则映射程序将发送所有您想要查找具有相同键的平均值的值。因此，在reducer中，可以对迭代器中的值求和。然后，您可以在迭代器工作时保留一个计数器，这就解决了需要平均多少项的问题。最后，在迭代器之后，您可以通过将总和除以项数来找到平均值
小心，如果组合器类设置为与reducer相同的类，则此逻辑将不起作用。使用所有3个映射器/组合器/reducer来解决此问题。
有关完整的代码和说明，请参阅下面的链接
平均值是总和/大小。如果sum类似于sum=k1+k2+k3+，您可以在总结之后或总结期间除以大小。所以平均值也是k1/size+k2/size+k3/size+
Java 8代码很简单：
    public double average(List<Valuable> list) {
      final int size = list.size();
      return list
            .stream()
            .mapToDouble(element->element.someValue())
            .reduce(0,(sum,x)->sum+x/size);
    }

公共双平均值（列表）{
final int size=list.size（）；
返回列表
.stream（）
.mapToDouble（元素->元素.someValue（））
.减少（0，（总和，x）->总和+x/大小）；
}

因此，首先将列表中元素的每个值映射为double，然后通过reduce函数求和。
算术平均值是一个聚合函数，它不是分布函数，而是代数函数。根据聚合函数，如果：
[…]可按如下方式计算[…]。假设[…]数据被划分为n个集合。我们将函数应用于每个分区，得到n个聚合值。如果通过将函数应用于n个聚合值得到的结果与通过将函数应用于整个数据集（无分区）得到的结果相同，则可以以分布式方式计算函数
或者换句话说，它必须是关联的和可交换的。然而，根据下列条件，聚合函数是代数函数：
[…]它可以由具有m个参数（其中m是有界正整数）的代数函数计算，每个参数都是通过应用分布聚合函数获得的
对于算术平均值，这只是平均值=总和/计数。很明显，你还需要携带一个计数。但使用全球计数器似乎是一种滥用。计数器对org.apache.hadoop.mapreduce.Counter的描述如下：
跟踪映射/减少作业进度的命名计数器
计数器通常应用于有关作业的统计信息，但不能作为数据处理过程中计算的一部分
所以在一个分区内，你要做的一切就是把你的数字加起来，并跟踪它们的计数和总和（sum，count）；一个简单的方法可以是类似
的字符串
在映射器中，计数始终为1，总和为原始值本身。要减少已经存在的映射文件，您可以使用组合器并处理聚合，如（sum_1+…+sum_n，count_1+…+count_n）。这必须在减速器中重复，并在最终计算和/计数时完成请记住，此方法独立于使用的键
最后，这里有一个简单的例子，使用raw计算洛杉矶的“平均犯罪时间”：
public class Driver extends Configured implements Tool {
    enum Counters {
        DISCARDED_ENTRY
    }

    public static void main(String[] args) throws Exception {
        ToolRunner.run(new Driver(), args);
    }

    public int run(String[] args) throws Exception {
        Configuration configuration = getConf();

        Job job = Job.getInstance(configuration);
        job.setJarByClass(Driver.class);

        job.setMapperClass(Mapper.class);
        job.setMapOutputKeyClass(LongWritable.class);
        job.setMapOutputValueClass(Text.class);

        job.setCombinerClass(Combiner.class);
        job.setReducerClass(Reducer.class);
        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        return job.waitForCompletion(true) ? 0 : -1;
    }
}

public class Mapper extends org.apache.hadoop.mapreduce.Mapper<
    LongWritable,
    Text,
    LongWritable,
    Text
> {

    @Override
    protected void map(
        LongWritable key,
        Text value,
        org.apache.hadoop.mapreduce.Mapper<
            LongWritable,
            Text,
            LongWritable,
            Text
        >.Context context
    ) throws IOException, InterruptedException {
            // parse the CSV line
            ArrayList<String> values = this.parse(value.toString());

            // validate the parsed values
            if (this.isValid(values)) {

                // fetch the third and the fourth column
                String time = values.get(3);
                String year = values.get(2)
                    .substring(values.get(2).length() - 4);

                // convert time to minutes (e.g. 1542 -> 942)
                int minutes = Integer.parseInt(time.substring(0, 2))
                    * 60 + Integer.parseInt(time.substring(2,4));

                // create the aggregate atom (a/n)
                // with a = time in minutes and n = 1
                context.write(
                    new LongWritable(Integer.parseInt(year)),
                    new Text(Integer.toString(minutes) + ":1")
                );
            } else {
                // invalid line format, so we increment a counter
                context.getCounter(Driver.Counters.DISCARDED_ENTRY)
                    .increment(1);
            }
    }

    protected boolean isValid(ArrayList<String> values) {
        return values.size() > 3 
            && values.get(2).length() == 10 
            && values.get(3).length() == 4;
    }

    protected ArrayList<String> parse(String line) {
        ArrayList<String> values = new ArrayList<>();
        String current = "";
        boolean escaping = false;

        for (int i = 0; i < line.length(); i++){
            char c = line.charAt(i);

            if (c == '"') {
                escaping = !escaping;
            } else if (c == ',' && !escaping) {
                values.add(current);
                current = "";
            } else {
                current += c;
            }
        }

        values.add(current);

        return values;
    }
}

public class Combiner extends org.apache.hadoop.mapreduce.Reducer<
    LongWritable,
    Text,
    LongWritable,
    Text
> {

    @Override
    protected void reduce(
        LongWritable key,
        Iterable<Text> values,
        Context context
    ) throws IOException, InterruptedException {
        Long n = 0l;
        Long a = 0l;
        Iterator<Text> iterator = values.iterator();

        // calculate intermediate aggregates
        while (iterator.hasNext()) {
            String[] atom = iterator.next().toString().split(":");
            a += Long.parseLong(atom[0]);
            n += Long.parseLong(atom[1]);
        }

        context.write(key, new Text(Long.toString(a) + ":" + Long.toString(n)));
    }
}

public class Reducer extends org.apache.hadoop.mapreduce.Reducer<
    LongWritable,
    Text,
    LongWritable,
    Text
> {

    @Override
    protected void reduce(
        LongWritable key, 
        Iterable<Text> values, 
        Context context
    ) throws IOException, InterruptedException {
        Long n = 0l;
        Long a = 0l;
        Iterator<Text> iterator = values.iterator();

        // calculate the finale aggregate
        while (iterator.hasNext()) {
            String[] atom = iterator.next().toString().split(":");
            a += Long.parseLong(atom[0]);
            n += Long.parseLong(atom[1]);
        }

        // cut of seconds
        int average = Math.round(a / n);

        // convert the average minutes back to time
        context.write(
            key,
            new Text(
                Integer.toString(average / 60) 
                    + ":" + Integer.toString(average % 60)
            )
        );
    }
}

公共类驱动程序扩展配置的工具{
枚举计数器{
丢弃的输入
}
公共静态void main（字符串[]args）引发异常{
运行（新驱动程序（），args）；
}
公共int运行（字符串[]args）引发异常{
配置=getConf（）；
Job Job=Job.getInstance（配置）；
job.setJarByClass（Driver.class）；
setMapperClass（Mapper.class）；
setMapOutputKeyClass（LongWritable.class）；
job.setMapOutputValueClass（Text.class）；
job.setCombinerClass（Combiner.class）；
job.setReducerClass（Reducer.class）；
job.setOutputKeyClass（LongWritable.class）；
job.setOutputValueClass（Text.class）；
addInputPath（作业，新路径（args[0]）；
setOutputPath（作业，新路径（args[1]）；
返回作业。waitForCompletion（true）？0:-1；
}
}
公共类映射器扩展org.apache.hadoop.mapreduce.Mapper<
可写的，
文本，
可写的，
正文
> {
@凌驾
受保护的空图(
长可写密钥，
文本值，
org.apache.hadoop.mapreduce.Mapper<
可写的，
文本，
可写的，
正文
>.语境
)抛出IOException、InterruptedException{
//解析CSV行
ArrayList values=this.parse（value.toString（））；
//验证解析的值
if（this.isValid（值））{
//获取第三列和第四列
字符串时间=值。获取（3）；
字符串年份=值。获取（2）
.substring（values.get（2）.length（）-4）；
//将时间转换为分钟（例如1542->942）
int minutes=Integer.parseInt（time.substring（0,2））
*60+整数.parseInt（time.substring（2,4））；
//创建聚合原子（a/n）
//a=以分钟为单位的时间，n=1