Java mapreduce二级排序不';行不通
我正在尝试使用组合键在mapreduce中进行二次排序,该组合键包括:Java mapreduce二级排序不';行不通,java,hadoop,mapreduce,secondary-sort,Java,Hadoop,Mapreduce,Secondary Sort,我正在尝试使用组合键在mapreduce中进行二次排序,该组合键包括: 字符串自然键=程序名 用于排序的长键=自1970年以来以毫秒为单位的时间 问题是,在排序之后,我根据整个组合键得到了大量的约简 通过调试,我已经验证了hashcode和compare函数是正确的。 在调试日志中,每个块都来自不同的减速器,这表明分组或分区没有成功。 从调试日志: 14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=the voice 14/12/14
- 字符串自然键=程序名
- 用于排序的长键=自1970年以来以毫秒为单位的时间
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=the voice
14/12/14 00:55:12 INFO popularitweet.EtanReducer: the voice: Thu Dec 11 17:51:03 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: the voice: Thu Dec 11 17:51:03 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key the voice ended
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=top gear
14/12/14 00:55:12 INFO popularitweet.EtanReducer: top gear: Thu Dec 11 17:51:04 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key top gear ended
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=american horror story
14/12/14 00:55:12 INFO popularitweet.EtanReducer: american horror story: Thu Dec 11 17:51:04 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key american horror story ended
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key=the voice
14/12/14 00:55:12 INFO popularitweet.EtanReducer: the voice: Thu Dec 11 17:51:04 +0000 2014
14/12/14 00:55:12 INFO popularitweet.EtanReducer: key the voice ended
正如您所见,语音被发送到两个不同的减缩器,但时间戳不同。
任何帮助都将不胜感激。
复合键为以下类别:
public class ProgramKey implements WritableComparable<ProgramKey> {
private String program;
private Long timestamp;
public ProgramKey() {
}
public ProgramKey(String program, Long timestamp) {
this.program = program;
this.timestamp = timestamp;
}
@Override
public int compareTo(ProgramKey o) {
int result = program.compareTo(o.program);
if (result == 0) {
result = timestamp.compareTo(o.timestamp);
}
return result;
}
@Override
public void write(DataOutput dataOutput) throws IOException {
WritableUtils.writeString(dataOutput, program);
dataOutput.writeLong(timestamp);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
program = WritableUtils.readString(dataInput);
timestamp = dataInput.readLong();
}
}
}
编辑
你的时间比较器似乎有一个输入错误。。。当ts2应设置为b时,您将其设置为a:
ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)a;
何时应该:
ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)b;
这将导致键/值对排序错误,并使分组比较器对键/值对进行排序的假设无效
还要检查原始程序名是否在UTF-8中,因为WritableUtils是这样假设的。系统的默认代码页也是UTF-8吗?我浏览了GroupingComparator、Partitioner和SortComparator类,以及作业代码,它们都是正确的。为什么不试试下面的方法:设置一个较少的reducer,看看你得到了什么reduce键。另一个测试:打印出减速器内的减速器键,看看不同的复合键是否会连接到不同的减速器。我按照您的建议执行(job.setNumReduceTasks(10)),只打印键。我得到:14/12/14 09:12:51信息电子减速器:新减速器14/12/14 09:12:51信息电子减速器:键=x因子14/12/14 09:12:51信息电子减速器:x因子:1418320302000 14/12/14 09:12:51信息电子减速器:x因子:1418320302000 14/12/14 09:12:51信息电子减速器:键x因子结束14/12/14 09:12:12:51信息电子减速器:新减速器14/12/14 09:12:51INFO ETANREDUCTER:key=x因子14/12/14 09:12:51 INFO ETANREDUCTER:x因子:1418320302000 14/12/14 09:12:51 INFO ETANREDUCTER:key x因子简而言之,问题再次出现。也许我有这个问题,因为我在本地运行hadoop?或者我不应该通过调用“context.write(new ProgramKey(program.toString(),DateUtils.textToDate(timeStamp.getTime()),passedTweet)”来发出密钥,抱歉,我没有更多的帮助。我只是在想如何解决这个问题。也许你可以找到一些模式,对于这些模式,只有时间戳不同的键会被不同的减缩器使用——这是不应该发生的。哇,好眼睛。我浏览了整个代码,没有注意到它。
public class TimeStampComparator extends WritableComparator {
protected TimeStampComparator() {
super(ProgramKey.class, true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)a;
int result = ts1.getProgram().compareTo(ts2.getProgram());
if (result == 0) {
result = ts1.getTimestamp().compareTo(ts2.getTimestamp());
}
return result;
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
// Create configuration
Configuration conf = new Configuration();
// Create job
Job job = new Job(conf, "test1");
job.setJarByClass(EtanMapReduce.class);
// Set partitioner keyComparator and groupComparator
job.setPartitionerClass(ProgramKeyPartitioner.class);
job.setGroupingComparatorClass(ProgramKeyGroupingComparator.class);
job.setSortComparatorClass(TimeStampComparator.class);
// Setup MapReduce
job.setMapperClass(EtanMapper.class);
job.setMapOutputKeyClass(ProgramKey.class);
job.setMapOutputValueClass(TweetObject.class);
job.setReducerClass(EtanReducer.class);
// Specify key / value
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(TweetObject.class);
// Input
FileInputFormat.addInputPath(job, inputPath);
job.setInputFormatClass(TextInputFormat.class);
// Output
FileOutputFormat.setOutputPath(job, outputDir);
job.setOutputFormatClass(TextOutputFormat.class);
// Delete output if exists
FileSystem hdfs = FileSystem.get(conf);
if (hdfs.exists(outputDir))
hdfs.delete(outputDir, true);
// Execute job
logger.info("starting job");
int code = job.waitForCompletion(true) ? 0 : 1;
System.exit(code);
}
ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)a;
ProgramKey ts1 = (ProgramKey)a;
ProgramKey ts2 = (ProgramKey)b;