Java 在Hadoop中将精简的数据拆分为输出和新输入
我已经四处寻找了几天,试图找到一种在hadoop中使用精简数据进行进一步映射的方法。我将类Java 在Hadoop中将精简的数据拆分为输出和新输入,java,hadoop,split,mapreduce,Java,Hadoop,Split,Mapreduce,我已经四处寻找了几天,试图找到一种在hadoop中使用精简数据进行进一步映射的方法。我将类A的对象作为输入数据,将类B的对象作为输出数据。问题是,映射时不仅生成Bs,还生成新的As 以下是我想要实现的目标: 1.1 input: a list of As 1.2 map result: for each A a list of new As and a list of Bs is generated 1.3 reduce: filtered Bs are saved as output, fil
A
的对象作为输入数据,将类B
的对象作为输出数据。问题是,映射时不仅生成B
s,还生成新的A
s
以下是我想要实现的目标:
1.1 input: a list of As
1.2 map result: for each A a list of new As and a list of Bs is generated
1.3 reduce: filtered Bs are saved as output, filtered As are added to the map jobs
2.1 input: a list of As produced by the first map/reduce
2.2 map result: for each A a list of new As and a list of Bs is generated
2.3 ...
3.1 ...
你应该了解基本的想法
我读过很多关于链接的书,但我不知道如何将ChainReducer和ChainMapper结合起来,甚至不知道这是否是正确的方法
因此,我的问题是:如何分割映射的数据,同时缩小以将一部分保存为输出,另一部分保存为新的输入数据。尝试使用。正如Javadoc所建议的:
MultipleOutputs类简化了将输出数据写入多个
输出
案例一:写入作业默认值以外的其他输出
输出。可以配置每个附加输出或命名输出
具有自己的OutputFormat、自己的密钥类和自己的
价值等级
案例二:将数据写入用户提供的不同文件
作业提交的使用模式:
Job job = new Job();
FileInputFormat.setInputPath(job, inDir);
FileOutputFormat.setOutputPath(job, outDir);
job.setMapperClass(MOMap.class);
job.setReducerClass(MOReduce.class);
...
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
LongWritable.class, Text.class);
// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq",
SequenceFileOutputFormat.class,
LongWritable.class, Text.class);
...
job.waitForCompletion(true);
...
String generateFileName(K k, V v) {
return k.toString() + "_" + v.toString();
}
public class MOReduce extends
Reducer<WritableComparable, Writable,WritableComparable, Writable> {
private MultipleOutputs mos;
public void setup(Context context) {
...
mos = new MultipleOutputs(context);
}
public void reduce(WritableComparable key, Iterator<Writable> values,
Context context)
throws IOException {
...
mos.write("text", , key, new Text("Hello"));
mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a");
mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b");
mos.write(key, new Text("value"), generateFileName(key, new Text("value")));
...
}
public void cleanup(Context) throws IOException {
mos.close();
...
}
}
在减速器中的用法:
Job job = new Job();
FileInputFormat.setInputPath(job, inDir);
FileOutputFormat.setOutputPath(job, outDir);
job.setMapperClass(MOMap.class);
job.setReducerClass(MOReduce.class);
...
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
LongWritable.class, Text.class);
// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq",
SequenceFileOutputFormat.class,
LongWritable.class, Text.class);
...
job.waitForCompletion(true);
...
String generateFileName(K k, V v) {
return k.toString() + "_" + v.toString();
}
public class MOReduce extends
Reducer<WritableComparable, Writable,WritableComparable, Writable> {
private MultipleOutputs mos;
public void setup(Context context) {
...
mos = new MultipleOutputs(context);
}
public void reduce(WritableComparable key, Iterator<Writable> values,
Context context)
throws IOException {
...
mos.write("text", , key, new Text("Hello"));
mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a");
mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b");
mos.write(key, new Text("value"), generateFileName(key, new Text("value")));
...
}
public void cleanup(Context) throws IOException {
mos.close();
...
}
}
String generateFileName(K,V){
返回k.toString()+“”+v.toString();
}
公共类MOReduce扩展
减速器{
私人多路输出mos;
公共无效设置(上下文){
...
mos=新的多输出(上下文);
}
public void reduce(可写可比键、迭代器值、,
上下文(上下文)
抛出IOException{
...
写(“文本”,键,新文本(“你好”);
最新的文字(“seq”,LongWritable(1),新文本(“Bye”),“seq_a”);
mos.write(“seq”,LongWritable(2),key,new Text(“Chau”),“seq_b”);
mos.write(键,新文本(“值”)),generateFileName(键,新文本(“值”));
...
}
公共无效清除(上下文)引发IOException{
mos.close();
...
}
}
请注意,这些代码示例适用于Hadoop 0.*而不是1.0.4。当我使用1.0.4时,界面略有变化。但基本的想法是我一直在寻找的。非常感谢。没错。这是0.20美元