Java mapreduce二级排序不'；行不通_Java_Hadoop_Mapreduce_Secondary Sort

Java mapreduce二级排序不'；行不通

java hadoop mapreduce

Java mapreduce二级排序不'；行不通,java,hadoop,mapreduce,secondary-sort,Java,Hadoop,Mapreduce,Secondary Sort,我正在尝试使用组合键在mapreduce中进行二次排序，该组合键包括：字符串自然键=程序名用于排序的长键=自1970年以来以毫秒为单位的时间问题是，在排序之后，我根据整个组合键得到了大量的约简通过调试，我已经验证了hashcode和compare函数是正确的。在调试日志中，每个块都来自不同的减速器，这表明分组或分区没有成功。从调试日志： 14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=the voice 14/12/14

我正在尝试使用组合键在mapreduce中进行二次排序，该组合键包括：

字符串自然键=程序名
用于排序的长键=自1970年以来以毫秒为单位的时间

问题是，在排序之后，我根据整个组合键得到了大量的约简

通过调试，我已经验证了hashcode和compare函数是正确的。在调试日志中，每个块都来自不同的减速器，这表明分组或分区没有成功。从调试日志：

14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=the voice
14/12/14 00:55:12 INFO popularitweet.EtanReducer: the voice: Thu Dec 11 17:51:03 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: the voice: Thu Dec 11 17:51:03 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key the voice ended



14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=top gear
14/12/14 00:55:12 INFO popularitweet.EtanReducer: top gear: Thu Dec 11 17:51:04 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key top gear ended



14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=american horror story
14/12/14 00:55:12 INFO popularitweet.EtanReducer: american horror story: Thu Dec 11 17:51:04 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key american horror story ended



14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=the voice
14/12/14 00:55:12 INFO popularitweet.EtanReducer: the voice: Thu Dec 11 17:51:04 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key the voice ended

正如您所见，语音被发送到两个不同的减缩器，但时间戳不同。任何帮助都将不胜感激。复合键为以下类别：

public class ProgramKey implements WritableComparable<ProgramKey> {
private String program;
private Long timestamp;

public ProgramKey() {
}

public ProgramKey(String program, Long timestamp) {
    this.program = program;
    this.timestamp = timestamp;
}

@Override
public int compareTo(ProgramKey o) {
    int result = program.compareTo(o.program);
    if (result == 0) {
        result = timestamp.compareTo(o.timestamp);
    }
    return result;
}

@Override
public void write(DataOutput dataOutput) throws IOException {
    WritableUtils.writeString(dataOutput, program);
    dataOutput.writeLong(timestamp);
}

@Override
public void readFields(DataInput dataInput) throws IOException {
    program = WritableUtils.readString(dataInput);
    timestamp = dataInput.readLong();
}

}

编辑

你的时间比较器似乎有一个输入错误。。。当ts2应设置为b时，您将其设置为a：

ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)a;

何时应该：

ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)b;

这将导致键/值对排序错误，并使分组比较器对键/值对进行排序的假设无效

还要检查原始程序名是否在UTF-8中，因为WritableUtils是这样假设的。系统的默认代码页也是UTF-8吗？

我浏览了GroupingComparator、Partitioner和SortComparator类，以及作业代码，它们都是正确的。为什么不试试下面的方法：设置一个较少的reducer，看看你得到了什么reduce键。另一个测试：打印出减速器内的减速器键，看看不同的复合键是否会连接到不同的减速器。我按照您的建议执行（job.setNumReduceTasks（10）），只打印键。我得到：14/12/14 09:12:51信息电子减速器：新减速器14/12/14 09:12:51信息电子减速器：键=x因子14/12/14 09:12:51信息电子减速器：x因子：1418320302000 14/12/14 09:12:51信息电子减速器：x因子：1418320302000 14/12/14 09:12:51信息电子减速器：键x因子结束14/12/14 09:12:12:51信息电子减速器：新减速器14/12/14 09:12:51INFO ETANREDUCTER:key=x因子14/12/14 09:12:51 INFO ETANREDUCTER:x因子：1418320302000 14/12/14 09:12:51 INFO ETANREDUCTER:key x因子简而言之，问题再次出现。也许我有这个问题，因为我在本地运行hadoop？或者我不应该通过调用“context.write（new ProgramKey（program.toString（），DateUtils.textToDate（timeStamp.getTime（）），passedTweet）”来发出密钥，抱歉，我没有更多的帮助。我只是在想如何解决这个问题。也许你可以找到一些模式，对于这些模式，只有时间戳不同的键会被不同的减缩器使用——这是不应该发生的。哇，好眼睛。我浏览了整个代码，没有注意到它。

public class TimeStampComparator extends WritableComparator {
protected TimeStampComparator() {
    super(ProgramKey.class, true);
}

@Override
public int compare(WritableComparable a, WritableComparable b) {
    ProgramKey ts1 = (ProgramKey)a;
    ProgramKey ts2 = (ProgramKey)a;

    int result = ts1.getProgram().compareTo(ts2.getProgram());
    if (result == 0) {
        result = ts1.getTimestamp().compareTo(ts2.getTimestamp());
    }
    return result;
}

    public static void main(String[] args) throws IOException,
        InterruptedException, ClassNotFoundException {



    // Create configuration
    Configuration conf = new Configuration();

    // Create job
    Job job = new Job(conf, "test1");
    job.setJarByClass(EtanMapReduce.class);

    // Set partitioner keyComparator and groupComparator
    job.setPartitionerClass(ProgramKeyPartitioner.class);
    job.setGroupingComparatorClass(ProgramKeyGroupingComparator.class);
    job.setSortComparatorClass(TimeStampComparator.class);

    // Setup MapReduce
    job.setMapperClass(EtanMapper.class);
    job.setMapOutputKeyClass(ProgramKey.class);
    job.setMapOutputValueClass(TweetObject.class);
    job.setReducerClass(EtanReducer.class);

    // Specify key / value
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(TweetObject.class);

    // Input
    FileInputFormat.addInputPath(job, inputPath);
    job.setInputFormatClass(TextInputFormat.class);

    // Output
    FileOutputFormat.setOutputPath(job, outputDir);
    job.setOutputFormatClass(TextOutputFormat.class);

    // Delete output if exists
    FileSystem hdfs = FileSystem.get(conf);
    if (hdfs.exists(outputDir))
        hdfs.delete(outputDir, true);

    // Execute job
    logger.info("starting job");
    int code = job.waitForCompletion(true) ? 0 : 1;
    System.exit(code);

}

ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)a;

ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)b;