在Hadoop中获取输入文件的分区id_Hadoop

在Hadoop中获取输入文件的分区id

hadoop

在Hadoop中获取输入文件的分区id,hadoop,Hadoop,我需要知道我正在使用的输入文件分区的行索引。我可以通过将行索引连接到数据来在原始文件中强制实现这一点，但我更愿意在Hadoop中实现这一点。我的地图里有这个 String id = context.getConfiguration().get("mapreduce.task.partition"); 但“id”在任何情况下都是0。在“Hadoop:the Definitive Guide”（Hadoop:the Definitive Guide）中，它提到访问分区id之类的属性“可以从传递给映

我需要知道我正在使用的输入文件分区的行索引。我可以通过将行索引连接到数据来在原始文件中强制实现这一点，但我更愿意在Hadoop中实现这一点。我的地图里有这个

String id = context.getConfiguration().get("mapreduce.task.partition");

但“id”在任何情况下都是0。在“Hadoop:the Definitive Guide”（Hadoop:the Definitive Guide）中，它提到访问分区id之类的属性“可以从传递给映射器或Reducer的所有方法的上下文对象访问”。据我所知，它实际上并没有涉及如何访问这些信息

我浏览了上下文对象的文档，看起来上面的方法就是这样做的，而且脚本确实可以编译。但由于我得到的每一个值都是0，我不确定我是否真的使用了正确的东西，我也无法在网上找到任何有助于解决这个问题的细节

用于测试的代码

public class Test {

public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String id = context.getConfiguration().get("mapreduce.task.partition");
        context.write(new Text("Test"), new Text(id + "_" + value.toString()));
    }
}


public static class TestReducer extends Reducer<Text, Text, Text, Text> {

    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

        for(Text value : values) {
            context.write(key, value);
        }
    }
}


public static void main(String[] args) throws Exception {

    if(args.length != 2) {
        System.err.println("Usage: Test <input path> <output path>");
        System.exit(-1);
    }

    Job job = new Job();
    job.setJarByClass(Test.class);
    job.setJobName("Test");

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setMapperClass(TestMapper.class);
    job.setReducerClass(TestReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);

    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

公共类测试{
公共静态类TestMapper扩展了Mapper{
公共void映射（LongWritable键、文本值、上下文上下文）引发IOException、InterruptedException{
String id=context.getConfiguration（）.get（“mapreduce.task.partition”）；
write（新文本（“测试”）、新文本（id+““+value.toString（））；
}
}
公共静态类TestReducer扩展了Reducer{
公共void reduce（文本键、Iterable值、上下文上下文）引发IOException、InterruptedException{
用于（文本值：值）{
编写（键、值）；
}
}
}
公共静态void main（字符串[]args）引发异常{
如果（参数长度！=2）{
System.err.println（“用法：测试”）；
系统退出（-1）；
}
作业=新作业（）；
job.setJarByClass（Test.class）；
job.setJobName（“测试”）；
addInputPath（作业，新路径（args[0]）；
setOutputPath（作业，新路径（args[1]）；
setMapperClass（TestMapper.class）；
setReducerClass（TestReducer.class）；
job.setOutputKeyClass（Text.class）；
job.setOutputValueClass（Text.class）；
系统退出（作业等待完成（真）？0:1；
}
}

有两个选项：

使用偏移量而不是行号

在映射器中跟踪行号

对于第一个，键

LongWritable

告诉您正在处理的行的偏移量。除非行的长度完全相同，否则无法从偏移量计算行号，但它确实允许您确定排序是否有用

第二个选项是只在映射器中跟踪它。您可以将代码更改为以下内容：

public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {

    private long currentLineNum = 0;
    private Text test = new Text("Test");   

    public void map(LongWritable key, Text value, Context context) 
                          throws IOException, InterruptedException {

        context.write(test, new Text(currentLineNum + "_" + value));
        currentLineNum++; 
    }
}

公共静态类TestMapper扩展了Mapper{
私有长currentLineNum=0；
专用文本测试=新文本（“测试”）；
公共void映射（可长写键、文本值、上下文）
抛出IOException、InterruptedException{
编写（测试，新文本（currentLineNum+“”+值））；
currentLineNum++；
}
}

您还可以将矩阵表示为元组行，并在每个元组上包含行和列，这样当您在文件中读取时，您就可以获得这些信息。如果您使用的文件只是构成2D数组的空格或逗号分隔值，则很难确定您当前在映射器中使用的是哪一行（行）

我不清楚输入文件分区的

行索引实际上是什么意思。你能澄清一下吗？@BinaryNerd我可能错了，但我认为它应该是输入文件的行id。假设文件中有100行，我希望知道映射器正在处理的当前行是什么（所以是0-99或1-100之间的数字）。嗯，我可以使用偏移量。我看到它是6的倍数，是的，线的长度应该是相同的。对于您上面提供的内容，它如何知道当前行是什么？我的假设是，映射器同时工作，因此上面的计数器可能没有正确的行编号。例如，如果要完成的第一个映射器是第5行的映射器，那么它的当前行号不是为1吗？映射器的每个实例都将依次处理文件中的行/拆分其工作。如果有多个映射程序正在运行，则每个映射程序都将处理自己的分割。不存在对文件的并发访问，因此您可以使用上述简单方法跟踪该行。你需要确保你的输入没有分裂，所以使用类似gz压缩的东西。明白了。非常感谢你。我以为它是并发的，所以变量会被关闭，但只是在一个大数据集上测试了它，并完全按照您提到的那样工作。谢谢你的帮助。