Java 链接两个作业时hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex中的NullPointerException_Java_Hadoop_Nullpointerexception_Mapreduce

Java 链接两个作业时hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex中的NullPointerException

java hadoop mapreduce

Java 链接两个作业时hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex中的NullPointerException,java,hadoop,nullpointerexception,mapreduce,Java,Hadoop,Nullpointerexception,Mapreduce,我正在尝试建立反向索引我有两份工作基本上，第一个作业解析并清理输入，并将结果存储在“output”文件夹中，该文件夹是第二个作业的输入文件夹第二项工作是实际构建反向索引当我刚得到第一份工作时，它工作得很好（至少没有例外）我做了两份这样的工作： public class Main { public static void main(String[] args) throws Exception { String inputPath = args[0];

我正在尝试建立反向索引

我有两份工作

基本上，第一个作业解析并清理输入，并将结果存储在“output”文件夹中，该文件夹是第二个作业的输入文件夹

第二项工作是实际构建反向索引

当我刚得到第一份工作时，它工作得很好（至少没有例外）

我做了两份这样的工作：

public class Main {

    public static void main(String[] args) throws Exception {

        String inputPath = args[0];
        String outputPath = args[1];
        String stopWordsPath = args[2];
        String finalOutputPath = args[3];

        Configuration conf = new Configuration();    
        conf.set("job.stopwords.path", stopWordsPath);

        Job job = Job.getInstance(conf, "Tokenize");

        job.setJobName("Tokenize");
        job.setJarByClass(TokenizerMapper.class);

        job.setNumReduceTasks(1);

        FileInputFormat.setInputPaths(job, new Path(inputPath));
        FileOutputFormat.setOutputPath(job, new Path(outputPath));

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(PostingListEntry.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(PostingListEntry.class);

        job.setOutputFormatClass(MapFileOutputFormat.class);

        job.setMapperClass(TokenizerMapper.class);
        job.setReducerClass(TokenizerReducer.class);

        // Delete the output directory if it exists already.
        Path outputDir = new Path(outputPath);
        FileSystem.get(conf).delete(outputDir, true);

        long startTime = System.currentTimeMillis();
        job.waitForCompletion(true);
        System.out.println("Job Finished in " + (System.currentTimeMillis() - startTime) / 1000.0 + " seconds");

        //-------------------------------------------------------------------------

        Configuration conf2 = new Configuration();    

        Job job2 = Job.getInstance(conf2, "BuildIndex");

        job2.setJobName("BuildIndex");
        job2.setJarByClass(InvertedIndexMapper.class);

        job2.setOutputFormatClass(TextOutputFormat.class);

        job2.setNumReduceTasks(1);

        FileInputFormat.setInputPaths(job2, new Path(outputPath));
        FileOutputFormat.setOutputPath(job2, new Path(finalOutputPath));

        job2.setOutputKeyClass(Text.class);
        job2.setOutputValueClass(PostingListEntry.class);

        job2.setMapperClass(InvertedIndexMapper.class);
        job2.setReducerClass(InvertedIndexReducer.class);

        // Delete the output directory if it exists already.
        Path finalOutputDir = new Path(finalOutputPath);
        FileSystem.get(conf2).delete(finalOutputDir, true);

        startTime = System.currentTimeMillis();
        // THIS LINE GIVES ERROR: 
        job2.waitForCompletion(true);
        System.out.println("Job Finished in " + (System.currentTimeMillis() - startTime) / 1000.0 + " seconds");
    }
}

我得到一份工作

Exception in thread "main" java.lang.NullPointerException
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
    at Main.main(Main.java:79)

此配置有什么问题，如何链接作业？

不清楚您是否有意在第一个作业中使用

MapFileOutputFormat

作为输出格式。更常见的方法是在第二个作业中使用

SequenceFileOutputFormat

和

SequenceFileInputFormat

作为输入格式

目前，您已经将

MapFileOutputFormat

指定为第一个作业的输出，而第二个作业中没有指定输入，因此它将是

TextInputFormat

，不太可能工作

查看您的

TokenizerReducer

类

reduce

方法的签名不正确。你有：

public void reduce(Text key, Iterator<PostingListEntry> values, Context context)

public void reduce（文本键、迭代器值、上下文）

应该是：

public void reduce(Key key, Iterable<PostingListEntry> values, Context context)

public void reduce（关键字、Iterable值、上下文）

正因为如此，它不会调用您的实现，因此它只是一个身份缩减。

很难确切地说出到底是什么错了。您能否将整个源代码发布或上传到某个地方（GitHub？）以重现问题？请查看：。现在我甚至在执行第一个作业时也出现了一个错误（使用了TokenizeMapper和TokenizeReducer）。我认为我使用ArrayListWritable类的方式可能有问题，我从这里获得了这个类：我非常感谢任何帮助！谢谢你发布源代码！您能指出什么命令行参数（它们的确切值）吗运行程序时应使用？第一个命令行参数：

/Users/osopova/Documents/00_KSU\u Masters/00_2016_Fall/01_Information\u Retrieval/02_prog\u 1/vectorsecretrievalsystem/data/cranfield.txt

第二个命令行参数：

输出

第三个命令行参数：

/Users/osopova/Documents/00_KSU_Masters/00_2016_Fall/01_Information_Retrieval/02_prog_1/vectorspacteretrievalsystem/stopwords/stopwords_small_list.txt

第四个命令行参数：

finaloutput

——在我的笔记本电脑中，就是这样！我之所以使用

MapFileOutputFormat

，是因为我在这里采用了以下方法：@Oleksandra我在快速查看了代码后为您添加了一个更新。