Java 在一个驱动程序中运行依赖的hadoop作业_Java_Hadoop_Distributed Cache

Java 在一个驱动程序中运行依赖的hadoop作业

java hadoop

Java 在一个驱动程序中运行依赖的hadoop作业,java,hadoop,distributed-cache,Java,Hadoop,Distributed Cache,我目前有两个hadoop作业，其中第二个作业要求将第一个作业的输出添加到分布式缓存中。目前我手动运行它们，所以在第一个作业完成后，我将输出文件作为参数传递给第二个作业，其驱动程序将其添加到缓存中第一个作业只是一个简单的仅映射作业，我希望在按顺序执行两个作业时可以运行一个命令有人能帮我把第一个作业的输出放到分布式缓存中，这样就可以把它传递到第二个作业中吗谢谢编辑：这是作业1的当前驱动程序： public class PlaceDriver { public static void ma

我目前有两个hadoop作业，其中第二个作业要求将第一个作业的输出添加到分布式缓存中。目前我手动运行它们，所以在第一个作业完成后，我将输出文件作为参数传递给第二个作业，其驱动程序将其添加到缓存中

第一个作业只是一个简单的仅映射作业，我希望在按顺序执行两个作业时可以运行一个命令

有人能帮我把第一个作业的输出放到分布式缓存中，这样就可以把它传递到第二个作业中吗

谢谢

编辑：这是作业1的当前驱动程序：

public class PlaceDriver {

public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
        System.err.println("Usage: PlaceMapper <in> <out>");
        System.exit(2);
    }
    Job job = new Job(conf, "Place Mapper");
    job.setJarByClass(PlaceDriver.class);
    job.setMapperClass(PlaceMapper.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    TextInputFormat.addInputPath(job, new Path(otherArgs[0]));
    TextOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

公共类PlaceDriver{
公共静态void main（字符串[]args）引发异常{
Configuration conf=新配置（）；
String[]otherArgs=新的GenericOptionsParser（conf，args）；
if（otherArgs.length！=2）{
System.err.println（“用法：PlaceMapper”）；
系统出口（2）；
}
作业作业=新作业（配置，“位置映射器”）；
job.setJarByClass（PlaceDriver.class）；
setMapperClass（PlaceMapper.class）；
job.setOutputKeyClass（Text.class）；
job.setOutputValueClass（Text.class）；
addInputPath（作业，新路径（其他参数[0]）；
setOutputPath（作业，新路径（其他参数[1]）；
系统退出（作业等待完成（真）？0:1；
}
}

这是工作2的司机。作业1的输出作为第一个参数传递给作业2，并加载到缓存中

public class LocalityDriver {

public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 3) {
        System.err.println("Usage: LocalityDriver <cache> <in> <out>");
        System.exit(2);
    }
    Job job = new Job(conf, "Job Name Here");
    DistributedCache.addCacheFile(new Path(otherArgs[0]).toUri(),job.getConfiguration());
    job.setNumReduceTasks(1); //TODO: Will change
    job.setJarByClass(LocalityDriver.class);
    job.setMapperClass(LocalityMapper.class);
    job.setCombinerClass(TopReducer.class);
    job.setReducerClass(TopReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    TextInputFormat.addInputPath(job, new Path(otherArgs[1]));
    TextOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

公共类LocalityDriver{
公共静态void main（字符串[]args）引发异常{
Configuration conf=新配置（）；
String[]otherArgs=新的GenericOptionsParser（conf，args）；
if（otherArgs.length！=3）{
System.err.println（“用法：LocalityDriver”）；
系统出口（2）；
}
Job Job=新作业（conf，“此处的作业名称”）；
DistributedCache.addCacheFile（新路径（其他参数[0]）.toUri（），job.getConfiguration（））；
job.setNumReduceTasks（1）；//TODO:将更改
job.setJarByClass（LocalityDriver.class）；
setMapperClass（LocalityMapper.class）；
作业.setCombinerClass（TopReducer.class）；
作业.setReducerClass（TopReducer.class）；
job.setOutputKeyClass（Text.class）；
job.setOutputValueClass（Text.class）；
addInputPath（作业，新路径（其他参数[1]）；
setOutputPath（作业，新路径（其他参数[2]）；
系统退出（作业等待完成（真）？0:1；
}
}

一个简单的答案是将两个主要方法的代码提取为两个单独的方法，例如：

boolean job1（）

和

boolean job2（）

，并在main方法中依次调用它们，如下所示：

public static void main(String[] args) throws Exception {
   if (job1()) {
      jobs2();
   }
}

其中

job1

和

job2

调用的返回值是调用

作业的结果。waitForCompletion（true）

一个简单的答案是将两个主要方法的代码提取到两个单独的方法中，例如：

booleanjob1（）

和

booleanjob2（）

并在main方法中依次调用它们，如下所示：

public static void main(String[] args) throws Exception {
   if (job1()) {
      jobs2();
   }
}

其中

job1

和

job2

调用的返回值是调用

job的结果。MapReduce中的waitForCompletion（true）

作业链接是非常常见的场景。你可以试试，一个开源的MapReduce工作流管理软件。还有一些关于级联的讨论正在进行。或者您也可以查看与您类似的讨论。

MapReduce中的作业链接是非常常见的场景。你可以试试，一个开源的MapReduce工作流管理软件。还有一些关于级联的讨论正在进行。或者，您也可以查看与您类似的讨论。

在同一个主视图中创建两个作业对象。让第一个等待完成，然后再运行另一个

public class DefaultTest extends Configured implements Tool{


    public int run(String[] args) throws Exception {

        Job job = new Job();

        job.setJobName("DefaultTest-blockx15");

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setNumReduceTasks(15);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setJarByClass(DefaultTest.class);

        job.waitForCompletion(true):

                job2 = new Job(); 

                // define your second job with the input path defined as the output of the previous job.


        return 0;
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        ToolRunner.run(new DefaultTest(), otherArgs);
    }
 }

在同一主目录中创建两个作业对象。让第一个等待完成，然后再运行另一个

public class DefaultTest extends Configured implements Tool{


    public int run(String[] args) throws Exception {

        Job job = new Job();

        job.setJobName("DefaultTest-blockx15");

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setNumReduceTasks(15);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setJarByClass(DefaultTest.class);

        job.waitForCompletion(true):

                job2 = new Job(); 

                // define your second job with the input path defined as the output of the previous job.


        return 0;
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        ToolRunner.run(new DefaultTest(), otherArgs);
    }
 }

您还可以使用ChainMapper、JobControl和ControlledJob来控制作业流

Configuration config = getConf();

Job j1 = new Job(config);
Job j2 = new Job(config);
Job j3 = new Job(config);

j1.waitForCompletion(true);


JobControl jobFlow = new JobControl("j2");
ControlledJob cj3 = new ControlledJob(j2, null);
jobFlow.addJob(cj3);
jobFlow.addJob(new ControlledJob(j2, Lists.newArrayList(cj3)));
jobFlow.addJob(new ControlledJob(j3, null));
jobFlow.run();

您还可以使用ChainMapper、JobControl和ControlledJob来控制作业流

Configuration config = getConf();

Job j1 = new Job(config);
Job j2 = new Job(config);
Job j3 = new Job(config);

j1.waitForCompletion(true);


JobControl jobFlow = new JobControl("j2");
ControlledJob cj3 = new ControlledJob(j2, null);
jobFlow.addJob(cj3);
jobFlow.addJob(new ControlledJob(j2, Lists.newArrayList(cj3)));
jobFlow.addJob(new ControlledJob(j3, null));
jobFlow.run();

您可以从这里开始编写调用这两个作业的代码，然后人们可以帮助您修改它。您可以从这里开始编写调用这两个作业的代码，然后人们可以帮助您修改它。然后如何将job1的输出加载到job2的分布式缓存中？据我所知，

new Path（otherArgs[1]）

正确设置为第一个作业的输出和第二个作业的输入。换句话说，将第一个作业输出到临时目录？我不建议这样做。您可以将其输出到普通目录，最后在第二个作业完成时手动将其删除。然后如何将输出从job1加载到job2的分布式缓存中？据我所知，

new Path（otherArgs[1]）

正确设置为第一个作业的输出和第二个作业的输入。换句话说，将第一个作业输出到临时目录？我不建议这样做。您可以将其输出到普通目录，最后在第二个作业完成时手动删除它。