Hadoop-应该是什么键和值

Hadoop-应该是什么键和值,hadoop,Hadoop,我是Hadoop的新手 我的目标是将大量具有不同扩展名的文件上载到Hadoop集群,并获得如下输出: 档案分机号码 .jpeg 1000 .java 600 .txt 3000 等等 我假设文件名必须是mapper方法的键,这样我就可以读取扩展名(将来还可以执行其他文件操作) 自定义文件输入格式 /** * */ package com.hadoop.mapred.scratchpad; import java.io.IOException; import org.apache.had

我是Hadoop的新手

我的目标是将大量具有不同扩展名的文件上载到Hadoop集群,并获得如下输出:

档案分机号码

.jpeg 1000 .java 600 .txt 3000

等等

我假设文件名必须是mapper方法的键,这样我就可以读取扩展名(将来还可以执行其他文件操作)

自定义文件输入格式

/**
 * 
 */
package com.hadoop.mapred.scratchpad;

import java.io.IOException;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;

public class CustomFileInputFormat extends
        FileInputFormat<String, NullWritable> {

    @Override
    public RecordReader<String, NullWritable> getRecordReader(InputSplit aFile,
            JobConf arg1, Reporter arg2) throws IOException {
        // TODO Auto-generated method stub

        System.out.println("In CustomFileInputFormat.getRecordReader(...)");
        /* the cast - ouch ! */
        CustomRecordReader custRecRdr = new CustomRecordReader(
                (FileSplit) aFile);

        return custRecRdr;
    }

}
日志(仅打开一个):

如图所示:

  • HDFS上的输出是一个0kb的文件
  • 日志仅在线程进入CustomRecordReader之前显示sysout
  • 我错过了什么?

    Kaliyug

    根据您的需要,不需要将文件名传递给mapper。它已在mapper中提供。只需按如下方式访问它。其余部分相当简单,只需模仿简单的字数计算程序即可

      FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
      String fileName = fileSplit.getPath().getName();
    
    如果是新的API,报告程序需要更改为上下文

    对于性能优化,您只需创建一个记录读取器,该读取器将简单地提供文件名作为映射器的键(与上面的方法相同)。使recordreader不读取任何文件内容。使值部分为空可写

    映射程序将文件名作为密钥。只需将reducer作为对发出即可


    Reducer需要执行与wordcount相同的逻辑

    嗨,阿伦,非常感谢你的指点!我已经编辑了我的原始问题——我写的代码与您评论中提到的“优化方法”有关。我不清楚你建议的Reporter的用法。请评估我写的代码。基本上,Reporter类附带了旧的hadoop API,而不是Reporter,只是使用上下文类对象,这就是我所提到的。好的。你能告诉我我在代码中犯了什么错误吗?Mapper、Reducer中的sysout没有出现,作业中没有错误/异常!在web UI上检查sysout。它们不会出现在系统控制台上。只需单击映射任务或redue tesk尝试id,就可以看到系统输出或日志。
    /**
     * 
     */
    package com.hadoop.mapred.scratchpad;
    
    import java.io.IOException;
    
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.mapred.FileInputFormat;
    import org.apache.hadoop.mapred.FileSplit;
    import org.apache.hadoop.mapred.InputSplit;
    import org.apache.hadoop.mapred.JobConf;
    import org.apache.hadoop.mapred.RecordReader;
    import org.apache.hadoop.mapred.Reporter;
    
    public class CustomFileInputFormat extends
            FileInputFormat<String, NullWritable> {
    
        @Override
        public RecordReader<String, NullWritable> getRecordReader(InputSplit aFile,
                JobConf arg1, Reporter arg2) throws IOException {
            // TODO Auto-generated method stub
    
            System.out.println("In CustomFileInputFormat.getRecordReader(...)");
            /* the cast - ouch ! */
            CustomRecordReader custRecRdr = new CustomRecordReader(
                    (FileSplit) aFile);
    
            return custRecRdr;
        }
    
    }
    
    /**
     * 
     */
    package com.hadoop.mapred.scratchpad;
    
    import java.io.IOException;
    
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.FileSplit;
    import org.apache.hadoop.mapred.InputSplit;
    import org.apache.hadoop.mapred.RecordReader;
    
    public class CustomRecordReader implements RecordReader<String, NullWritable> {
    
        private FileSplit aFile;
        private String fileName;
    
        public CustomRecordReader(FileSplit aFile) {
    
            this.aFile = aFile;
    
            System.out.println("In CustomRecordReader constructor aFile is "
                    + aFile.getClass().getName());
        }
    
        @Override
        public void close() throws IOException {
            // TODO Auto-generated method stub
    
        }
    
        @Override
        public String createKey() {
            // TODO Auto-generated method stub
            fileName = aFile.getPath().getName();
    
            System.out.println("In CustomRecordReader.createKey() "+fileName);
    
            return fileName;
        }
    
        @Override
        public NullWritable createValue() {
            // TODO Auto-generated method stub
            return null;
        }
    
        @Override
        public long getPos() throws IOException {
            // TODO Auto-generated method stub
            return 0;
        }
    
        @Override
        public float getProgress() throws IOException {
            // TODO Auto-generated method stub
            return 0;
        }
    
        @Override
        public boolean next(String arg0, NullWritable arg1) throws IOException {
            // TODO Auto-generated method stub
            return false;
        }
    
    }
    
    package com.hadoop.mapred.scratchpad;
    
    import java.io.IOException;
    
    import org.apache.commons.io.FilenameUtils;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.MapReduceBase;
    import org.apache.hadoop.mapred.Mapper;
    import org.apache.hadoop.mapred.OutputCollector;
    import org.apache.hadoop.mapred.Reporter;
    
    public class CustomMapperClass extends MapReduceBase implements
            Mapper<String, NullWritable, Text, IntWritable> {
    
        private static final int COUNT = 1;
    
        @Override
        public void map(String fileName, NullWritable value,
                OutputCollector<Text, IntWritable> outputCollector,
                Reporter reporter) throws IOException {
            // TODO Auto-generated method stub
            System.out.println("In CustomMapperClass.map(...) : key " + fileName
                    + " value = " + value);
    
            outputCollector.collect(new Text(FilenameUtils.getExtension(fileName)),
                    new IntWritable(COUNT));
    
            System.out.println("Returning from CustomMapperClass.map(...)");
        }
    
    }
    
    /**
     * 
     */
    package com.hadoop.mapred.scratchpad;
    
    import java.io.IOException;
    import java.util.Iterator;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.JobConf;
    import org.apache.hadoop.mapred.MapReduceBase;
    import org.apache.hadoop.mapred.OutputCollector;
    import org.apache.hadoop.mapred.Reducer;
    import org.apache.hadoop.mapred.Reporter;
    
    
    public class CustomReducer extends MapReduceBase implements
            Reducer<Text, IntWritable, Text, IntWritable> {
    
        @Override
        public void reduce(Text fileExtn, Iterator<IntWritable> countCollection,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            // TODO Auto-generated method stub
    
            System.out.println("In CustomReducer.reduce(...)");
            int count = 0;
    
            while (countCollection.hasNext()) {
                count += countCollection.next().get();
            }
    
            output.collect(fileExtn, new IntWritable(count));
    
            System.out.println("Returning CustomReducer.reduce(...)");
        }
    
    }
    
    hd@cloudx-538-520:~/hadoop/logs/userlogs$ hadoop fs -ls /scratchpad/output
    Warning: $HADOOP_HOME is deprecated.
    
    Found 3 items
    -rw-r--r--   4 hd supergroup          0 2012-10-11 20:52 /scratchpad/output/_SUCCESS
    drwxr-xr-x   - hd supergroup          0 2012-10-11 20:51 /scratchpad/output/_logs
    -rw-r--r--   4 hd supergroup          0 2012-10-11 20:52 /scratchpad/output/part-00000
    hd@cloudx-538-520:~/hadoop/logs/userlogs$
    hd@cloudx-538-520:~/hadoop/logs/userlogs$ hadoop fs -ls /scratchpad/output/_logs
    Warning: $HADOOP_HOME is deprecated.
    
    Found 1 items
    drwxr-xr-x   - hd supergroup          0 2012-10-11 20:51 /scratchpad/output/_logs/history
    hd@cloudx-538-520:~/hadoop/logs/userlogs$
    hd@cloudx-538-520:~/hadoop/logs/userlogs$
    
    hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$ ls -lrt
    total 16
    -rw-r----- 1 hd hd 393 2012-10-11 20:52 job-acls.xml
    lrwxrwxrwx 1 hd hd  95 2012-10-11 20:52 attempt_201210091538_0019_m_000000_0 -> /tmp/hadoop-hd/mapred/local/userlogs/job_201210091538_0019/attempt_201210091538_0019_m_000000_0
    lrwxrwxrwx 1 hd hd  95 2012-10-11 20:52 attempt_201210091538_0019_m_000002_0 -> /tmp/hadoop-hd/mapred/local/userlogs/job_201210091538_0019/attempt_201210091538_0019_m_000002_0
    lrwxrwxrwx 1 hd hd  95 2012-10-11 20:52 attempt_201210091538_0019_m_000001_0 -> /tmp/hadoop-hd/mapred/local/userlogs/job_201210091538_0019/attempt_201210091538_0019_m_000001_0
    hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$
    hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$ cat attempt_201210091538_0019_m_000000_0/stdout
    In CustomFileInputFormat.getRecordReader(...)
    In CustomRecordReader constructor aFile is org.apache.hadoop.mapred.FileSplit
    In CustomRecordReader.createKey() ExtJS_Notes.docx
    hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$
    hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$
    
      FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
      String fileName = fileSplit.getPath().getName();