Java8中的MapReduce输出排序_Java_Hadoop_Java 8_Reducers

Java8中的MapReduce输出排序

java hadoop java-8

Java8中的MapReduce输出排序,java,hadoop,java-8,reducers,Java,Hadoop,Java 8,Reducers,我尝试使用此解决方案在Hadoop中对我的reducer的输出进行排序，如本问题所述：这一个与Java8有一些冲突，所以我解决了它们，如下所示： import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; imp

我尝试使用此解决方案在

Hadoop

中对我的reducer的输出进行排序，如本问题所述：

这一个与Java8有一些冲突，所以我解决了它们，如下所示：

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.util.Map;
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.LinkedHashMap;
import java.util.Collections;
import java.util.List;
import java.util.Comparator;

public class HourlyTweetsReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    private IntWritable result = new IntWritable();
    public Map<String , Integer> map = new LinkedHashMap<String , Integer>();

    public void reduce(Text key, Iterable<IntWritable> values, Context context)
    throws IOException, InterruptedException {

        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }
        map.put(key.toString() , sum);

        result.set(sum);
        context.write(key, result);
    }

    public void cleanup(Context context){
        //Cleanup is called once at the end to finish off anything for reducer
        //Here we will write our final output
        Map<String , Integer>  sortedMap = new HashMap<String , Integer>();
        sortedMap = sortMap(map);

        for (Map.Entry<String,Integer> entry : sortedMap.entrySet()){
            context.write(new Text(entry.getKey()),new IntWritable(entry.getValue()));
        }
    }

    public Map<String , Integer > sortMap (Map<String,Integer> unsortMap){

        Map<String ,Integer> hashmap = new HashMap<String,Integer>();
        int count=0;
        List<Map.Entry<String,Integer>> list = new LinkedList<Map.Entry<String,Integer>>(unsortMap.entrySet());
        //Sorting the list we created from unsorted Map
        Collections.sort(list , new Comparator<Map.Entry<String,Integer>>(){
            public int compare (Map.Entry<String , Integer> o1 , Map.Entry<String , Integer> o2 ){
                //sorting in descending order
                return o2.getValue().compareTo(o1.getValue());
            }
        });

        for(Map.Entry<String, Integer> entry : list){
            // only writing top 3 in the sorted map
            // if(count>2)
            // break;
            hashmap.put(entry.getKey(),entry.getValue());
        }

        return hashmap ;
    }

}

我们如何解决它？

我在这里看到两个选项：

只需使用一个减速器。这就要求所有的数据都可以放在一台机器的内存中。然后，单个减速器的输入将按键的顺序（您想要的）排序

使用TotalOrderPartitioner

它在MapReduce管道中强制执行一个附加阶段，将元素划分为已排序的存储桶

下面是一个示例（不是我的示例），演示了如何使用TotalOrderPartitioner:
我在这里看到两个选项：

只需使用一个减速器。这就要求所有的数据都可以放在一台机器的内存中。然后，单个减速器的输入将按键的顺序（您想要的）排序

使用TotalOrderPartitioner
它在MapReduce管道中强制执行一个附加阶段，将元素划分为已排序的存储桶

这里有一个示例（不是我的示例）演示了如何使用TotalOrderPartitioner:
我不会判断此代码是否需要额外的步骤来确保Hadoop的Map/Reduce上下文中的正确性
但是一个明显的错误是在
sortMap
的开头有一行

Map<String ,Integer> hashmap = new HashMap<String,Integer>();
这里，对已创建映射的引用被
sortMap
的结果覆盖，因此映射实例完全过时。但是，由于您要做的只是在排序后的映射上迭代一次以执行一个操作，因此根本不需要将排序后的列表复制到结果
map
，因为您可以通过迭代列表来执行操作：

public void cleanup(Context context) { //Cleanup is called once at the end to finish off anything for reducer //Here we will write our final output List<Map.Entry<String,Integer>> list = new ArrayList<>(map.entrySet()); Collections.sort(list, Map.Entry.comparingByValue(Comparator.reverseOrder())); for(Map.Entry<String,Integer> entry: list) { context.write(new Text(entry.getKey()), new IntWritable(entry.getValue())); } }

请注意，此代码使用
ArrayList
而不是
LinkedList
，这是您将要对其执行的所有三个操作：1）使用映射条目集的内容初始化它，2）将其排序到位，3）对其进行迭代，使用
ArrayList
可以大大加快工作速度。Java 8中的步骤2）尤其如此。
我不会判断在Hadoop的Map/Reduce上下文中，该代码是否需要额外的步骤来确保正确性
但是一个明显的错误是在
sortMap
的开头有一行

Map<String ,Integer> hashmap = new HashMap<String,Integer>();
这里，对已创建映射的引用被
sortMap
的结果覆盖，因此映射实例完全过时。但是，由于您要做的只是在排序后的映射上迭代一次以执行一个操作，因此根本不需要将排序后的列表复制到结果
map
，因为您可以通过迭代列表来执行操作：

public void cleanup(Context context) { //Cleanup is called once at the end to finish off anything for reducer //Here we will write our final output List<Map.Entry<String,Integer>> list = new ArrayList<>(map.entrySet()); Collections.sort(list, Map.Entry.comparingByValue(Comparator.reverseOrder())); for(Map.Entry<String,Integer> entry: list) { context.write(new Text(entry.getKey()), new IntWritable(entry.getValue())); } }

请注意，此代码使用
ArrayList
而不是
LinkedList
，这是您将要对其执行的所有三个操作：1）使用映射条目集的内容初始化它，2）将其排序到位，3）对其进行迭代，使用
ArrayList
可以大大加快工作速度。Java 8中的步骤2）尤其如此。
HourlyTweets.Java使用或覆盖不推荐的API这是您使用的库还是您自己的程序？这是基于Hadoopwell的我自己的程序，它只是在错误消息中声明：
未报告的异常IOException；必须在HourlyTweetsReducer.java:45 中捕获或声明抛出，只需在指定的45 行上使用一个漂亮的try ，您在记事本中开发吗？一个像样的IDE甚至可能会给你选择做什么。HourlyTweets.java使用或覆盖一个不推荐的API这是你使用的库还是你自己的程序？这是我自己的基于Hadoopwell的程序，它只是在错误消息中声明：未报告的异常IOException；必须在HourlyTweetsReducer.java:45 中捕获或声明抛出，只需在指定的45 行上使用一个漂亮的try，您在记事本中开发吗？一个像样的IDE甚至可能会给你选择做什么。只是解释一下选项1。我看到的问题不在代码中，而是运行的减缩器的数量。如果运行多个reducer，那么它们会将您的数据分发到多个bucket中，并对每个bucket的数据进行排序，然后将它们合并，从而导致问题。谢谢，它是主作业文件的一部分，而不是映射器或reducer，对吗？好吧，在定义reducer数量的地方执行此操作，如果您还没有定义它们，那么将根据数据的大小决定数量。在主作业文件中定义它们是一种很好的做法，但要确保数据可以放入单机内存中。如果不使用选项2，它肯定会起作用。请解释一下选项1。我看到的问题不在代码中，而是运行的减缩器的数量。如果运行多个reducer，那么它们会将您的数据分发到多个bucket中，并对每个bucket的数据进行排序，然后将它们合并，从而导致问题。谢谢，它是主作业文件的一部分，而不是映射器或reducer，对吗？好吧，在定义reducer数量的地方执行此操作，如果您还没有定义它们，那么将根据数据的大小决定数量。在主作业文件中定义它们是一种很好的做法，但要确保数据可以放入单机内存中。如果不使用选项2，它肯定会工作。 new Comparator<Map.Entry<String, Integer>>() { public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) { //sorting in descending order return o2.getValue().compareTo(o1.getValue()); } }