Java 映射程序意外多次执行，打算运行一次_Java_Hadoop_Mapreduce_Mapper

Java 映射程序意外多次执行，打算运行一次

java hadoop mapreduce

Java 映射程序意外多次执行，打算运行一次,java,hadoop,mapreduce,mapper,Java,Hadoop,Mapreduce,Mapper,我试着编写一个非常简单的作业，只使用1个映射器，不使用减缩器，将一些数据写入hbase。在mapper中，我试图简单地打开与hbase的连接，将几行数据写入一个表，然后关闭连接。在作业驱动程序中，我使用JobConf.setNumMapTasks（1）；和JobConf.setNumReduceTasks（0）；指定只执行1个映射器而不执行减缩器。我还在jobConf中将reducer类设置为IdentityReducer。我观察到的奇怪行为是，作业成功地将数据写入hbase表，但在日志中看到之

我试着编写一个非常简单的作业，只使用1个映射器，不使用减缩器，将一些数据写入hbase。在mapper中，我试图简单地打开与hbase的连接，将几行数据写入一个表，然后关闭连接。在作业驱动程序中，我使用JobConf.setNumMapTasks（1）；和JobConf.setNumReduceTasks（0）；指定只执行1个映射器而不执行减缩器。我还在jobConf中将reducer类设置为IdentityReducer。我观察到的奇怪行为是，作业成功地将数据写入hbase表，但在日志中看到之后，它不断尝试打开与hbase的连接，然后关闭连接，该连接持续20-30分钟，并在作业被宣布以100%成功完成后关闭。最后，当我检查由我放入OutputCollector.collect（…）中的伪数据创建的_success文件时，我看到了数百行伪数据，而实际上只有1行。以下是作业驱动程序的代码

    public int run(String[] arg0) throws Exception {
        Configuration config = HBaseConfiguration.create(getConf());
        ensureRequiredParametersExist(config);
        ensureOptionalParametersExist(config);

        JobConf jobConf = new JobConf(config, getClass());
        jobConf.setJobName(config.get(ETLJobConstants.ETL_JOB_NAME));
        //set map specific configuration
        jobConf.setNumMapTasks(1);
        jobConf.setMaxMapAttempts(1);
        jobConf.setInputFormat(TextInputFormat.class);
        jobConf.setMapperClass(SingletonMapper.class);
        jobConf.setMapOutputKeyClass(LongWritable.class);
        jobConf.setMapOutputValueClass(Text.class);

        //set reducer specific configuration
        jobConf.setReducerClass(IdentityReducer.class);
        jobConf.setOutputKeyClass(LongWritable.class);
        jobConf.setOutputValueClass(Text.class);
        jobConf.setOutputFormat(TextOutputFormat.class);
        jobConf.setNumReduceTasks(0);

        //set job specific configuration details like input file name etc
        FileInputFormat.setInputPaths(jobConf, jobConf.get(ETLJobConstants.ETL_JOB_FILE_INPUT_PATH));
        System.out.println("setting output path to : " + jobConf.get(ETLJobConstants.ETL_JOB_FILE_OUTPUT_PATH));
        FileOutputFormat.setOutputPath(jobConf,
                new Path(jobConf.get(ETLJobConstants.ETL_JOB_FILE_OUTPUT_PATH)));
        JobClient.runJob(jobConf);
        return 0;
    }

驱动程序类扩展了配置和实现工具（我使用了《权威指南》中的示例），下面是我的mapper类中的代码

下面是Mapper的map方法中的代码，我只需打开与Hbase的连接，进行一些初步检查以确保表存在，然后写入行并关闭表

    public void map(LongWritable arg0, Text arg1,
        OutputCollector<LongWritable, Text> arg2, Reporter arg3)
        throws IOException {


    HTable aTable = null;
    HBaseAdmin admin = null;


    try {

        arg3.setStatus("started");

        /*
         * set-up hbase config
         */
        admin = new HBaseAdmin(conf);

        /*
         * open connection to table
         */
        String tableName = conf.get(ETLJobConstants.ETL_JOB_TABLE_NAME);

        HTableDescriptor htd = new HTableDescriptor(toBytes(tableName));
        String colFamilyName = conf.get(ETLJobConstants.ETL_JOB_TABLE_COLUMN_FAMILY_NAME);

        byte[] tablename = htd.getName();
        /* call function to ensure table with 'tablename' exists */

        /*
         * loop and put the file data into the table
         */
        aTable = new HTable(conf, tableName);

        DataRow row = /* logic to generate data */
        while (row != null) {
            byte[] rowKey = toBytes(row.getRowKey());
            Put put = new Put(rowKey);
            for (DataNode node : row.getRowData()) {
                put.add(toBytes(colFamilyName), toBytes(node.getNodeName()),
                        toBytes(node.getNodeValue()));
            }
            aTable.put(put);
            arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxo added another data row to hbase");
            row = fileParser.getNextRow();
        }
        aTable.flushCommits();
        arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxo Finished adding data to hbase");

    } finally {
        if (aTable != null) {
            aTable.close();
        }

        if (admin != null) {
            admin.close();
        }
    }

    arg2.collect(new LongWritable(10), new Text("something"));
    arg3.setStatus("xoxoxoxoxoxoxoxoxoxoxoxoadded some dummy data to the collector");
}

public void映射（可长写arg0，文本arg1，
输出收集器arg2、报告器arg3）
抛出IOException{
HTable-aTable=null；
HBaseAdmin admin=null；
试一试{
arg3.设置状态（“已启动”）；
/*
*设置hbase配置
*/
admin=新的HBaseAdmin（conf）；
/*
*打开到表的连接
*/
字符串tableName=conf.get（ETLJobConstants.ETL_JOB_TABLE_NAME）；
HTableDescriptor htd=新的HTableDescriptor（toBytes（tableName））；
字符串colFamilyName=conf.get（ETLJobConstants.ETL_JOB_TABLE_COLUMN_FAMILY_NAME）；
字节[]tablename=htd.getName（）；
/*调用函数以确保具有“tablename”的表存在*/
/*
*循环并将文件数据放入表中
*/
aTable=新的HTable（conf，tableName）；
DataRow row=/*生成数据的逻辑*/
while（行！=null）{
字节[]rowKey=toBytes（row.getRowKey（））；
Put Put=新Put（行键）；
对于（DataNode节点：row.getRowData（））{
put.add（toBytes（colFamilyName）、toBytes（node.getNodeName（）），
toBytes（node.getNodeValue（））；
}
aTable.put（put）；
arg3.setStatus（“xoxoxoxoxoxo向hbase添加了另一个数据行”）；
row=fileParser.getNextRow（）；
}
aTable.flushCommits（）；
arg3.setStatus（“xOxOxOxOxOxOxO已完成向hbase添加数据”）；
}最后{
if（aTable！=null）{
aTable.close（）；
}
if（admin！=null）{
admin.close（）；
}
}
arg2.collect（新的LongWritable（10），新的文本（“某物”）；
arg3.setStatus（“xOxOxOxOxOxOxOxOxOxOxOxOxOxOxOxOxO向收集器添加了一些虚拟数据”）；
}

正如您在结尾看到的，我在结尾（10，'something'）向集合写入了一些虚拟数据，在作业终止后，我在_success文件中看到了数百行这些数据。

我无法确定为什么映射程序代码会多次重新启动，而不是只运行一次。任何帮助都将不胜感激。

使用

JobConf.setNumMapTasks（1）

只是对hadoop说，如果可能的话，您希望使用1个映射器，而不是

setnumreductasks

，它实际上定义了您指定的数字

这就是为什么会有更多的地图绘制者在运行，你会观察到所有这些数字

欲了解更多详情，请阅读