Hadoop/MapReduce：从DDL生成的读写类_Hadoop_Mapreduce_Ddl

Hadoop/MapReduce：从DDL生成的读写类

hadoop mapreduce

Hadoop/MapReduce：从DDL生成的读写类,hadoop,mapreduce,ddl,Hadoop,Mapreduce,Ddl,有人能带我了解一下使用DDL生成的类读写数据的基本工作流程吗我使用DDL定义了一些类似结构的记录。例如： class Customer { ustring FirstName; ustring LastName; ustring CardNo; long LastPurchase; } 我编译了这个以获得一个Customer类，并将其包含到我的项目中。我可以很容易地看到如何将其用作映射器和还原器的输入和输出（生成的类实现了可写），但不知道如何将

有人能带我了解一下使用DDL生成的类读写数据的基本工作流程吗

我使用DDL定义了一些类似结构的记录。例如：

  class Customer {
     ustring FirstName;
     ustring LastName;
     ustring CardNo;
     long LastPurchase;
  }

我编译了这个以获得一个Customer类，并将其包含到我的项目中。我可以很容易地看到如何将其用作映射器和还原器的输入和输出（生成的类实现了可写），但不知道如何将其读写到文件中

org.apache.hadoop.record包的JavaDoc讨论了以二进制、CSV或XML格式序列化这些记录。我该怎么做呢？假设我的reducer生成可写键和客户值。我使用什么OutputFormat以CSV格式写入结果？如果我想对结果文件进行分析，我以后会使用什么输入格式读取结果文件

好吧，我想我已经弄明白了。我不确定这是否是最直接的方法，所以如果您知道更简单的工作流程，请纠正我

DDL生成的每个类都实现了记录接口，因此提供了两种方法：

序列化（记录输出）以进行写入反序列化（RecordInput in）以进行读取

RecordOutput和RecordInput是org.apache.hadoop.record包中提供的实用程序接口。有几种实现（例如XMLRecordOutput、BinaryRecordOutput、CSVRecordOutput）

据我所知，您必须实现自己的OutputFormat或InputFormat类才能使用这些。这是相当容易做到的

例如，我在原始问题中提到的OutputFormat（以CSV格式写入整数键和客户值的格式）的实现方式如下：


  private static class CustomerOutputFormat 
    extends TextOutputFormat<IntWritable, Customer> 
  {

    public RecordWriter<IntWritable, Customer> getRecordWriter(FileSystem ignored,
      JobConf job,
      String name,
      Progressable progress)
    throws IOException {
      Path file = FileOutputFormat.getTaskOutputPath(job, name);
      FileSystem fs = file.getFileSystem(job);
      FSDataOutputStream fileOut = fs.create(file, progress);
      return new CustomerRecordWriter(fileOut);
    }   

    protected static class CustomerRecordWriter 
      implements RecordWriter<IntWritable, Customer> 
    {

      protected DataOutputStream outStream ;

      public AnchorRecordWriter(DataOutputStream out) {
        this.outStream = out ; 
      }

      public synchronized void write(IntWritable key, Customer value) throws IOException {

        CsvRecordOutput csvOutput = new CsvRecordOutput(outStream);
        csvOutput.writeInteger(key.get(), "id") ;
        value.serialize(csvOutput) ; 
      }

      public synchronized void close(Reporter reporter) throws IOException {
        outStream.close();
      }
    }
  }


私有静态类CustomerOutputFormat
扩展TextOutputFormat
{
公共RecordWriter getRecordWriter（忽略文件系统，
JobConf job，
字符串名，
（可进展的进展）
抛出IOException{
Path file=FileOutputFormat.getTaskOutputPath（作业，名称）；
FileSystem fs=file.getFileSystem（作业）；
FSDataOutputStream fileOut=fs.create（文件，进度）；
返回新的CustomerRecordWriter（文件输出）；
}   
受保护的静态类CustomerRecordWriter
实现RecordWriter
{
受保护的数据输出流外流；
公共主播记录编写器（DataOutputStream out）{
this.outStream=out；
}
公共同步的无效写入（IntWritable键、客户值）引发IOException{
CsvRecordOutput csvOutput=新CsvRecordOutput（流出）；
csvOutput.writeInteger（key.get（），“id”）；
序列化（csvOutput）；
}
公共同步作废关闭（报告器）引发IOException{
exptream.close（）；
}
}
}

创建InputFormat与创建InputFormat大致相同。因为csv格式是每行一个条目，所以我们可以在内部使用LineRecordReader来完成大部分工作



private static class CustomerInputFormat extends FileInputFormat<IntWritable, Customer> {

  public RecordReader<IntWritable, Customer> getRecordReader(
    InputSplit genericSplit, 
    JobConf job,
    Reporter reporter)
  throws IOException {

    reporter.setStatus(genericSplit.toString());
    return new CustomerRecordReader(job, (FileSplit) genericSplit);
  }

  private class CustomerRecordReader implements RecordReader<IntWritable, Customer> {

    private LineRecordReader lrr ;

    public CustomerRecordReader(Configuration job, FileSplit split) 
    throws IOException{
      this.lrr = new LineRecordReader(job, split);    
    }

    public IntWritable createKey() {
      return new IntWritable();
    }

    public Customer createValue() {
      return new Customer();
    }

    public synchronized boolean next(IntWritable key, Customer value)
    throws IOException {

      LongWritable offset = new LongWritable() ;
      Text line = new Text() ;

      if (!lrr.next(offset, line))
        return false ;

      CsvRecordInput cri = new CsvRecordInput(new      
        ByteArrayInputStream(line.toString().getBytes())) ;
      key.set(cri.readInt("id")) ;
      value.deserialize(cri) ;

      return true ;
    }

    public float getProgress() {
      return lrr.getProgress() ;
    }

    public synchronized long getPos() throws IOException {
      return lrr.getPos() ;
    }

    public synchronized void close() throws IOException {
      lrr.close();
    }
  }
}


私有静态类CustomerInputFormat扩展FileInputFormat{
公共记录阅读器getRecordReader(
InputSplit genericSplit，
JobConf job，
（记者）
抛出IOException{
reporter.setStatus（genericSplit.toString（））；
返回新的CustomerRecordReader（作业，（文件分割）genericSplit）；
}
私有类CustomerRecordReader实现RecordReader{
专用LineRecordReader lrr；
公共CustomerRecordReader（配置作业、文件拆分）
抛出IOException{
this.lrr=新的LineRecordReader（作业，拆分）；
}
公共IntWritable createKey（）{
返回新的IntWritable（）；
}
公共客户createValue（）{
返回新客户（）；
}
公共同步布尔next（可写密钥，客户值）
抛出IOException{
LongWritable偏移量=新的LongWritable（）；
文本行=新文本（）；
如果（！lrr.next（偏移，行））
返回false；
CsvRecordInput cri=新CsvRecordInput（新
ByteArrayInputStream（line.toString（）.getBytes（））；
密钥集（cri.readInt（“id”）；
反序列化（cri）；
返回true；
}
公共进度（）{
返回lrr.getProgress（）；
}
公共同步的长getPos（）引发IOException{
返回lrr.getPos（）；
}
public synchronized void close（）引发IOException{
lrr.close（）；
}
}
}

好的，我想我已经解决了这个问题。我不确定这是否是最直接的方法，所以如果您知道更简单的工作流程，请纠正我

DDL生成的每个类都实现了记录接口，因此提供了两种方法：

序列化（记录输出）以进行写入反序列化（RecordInput in）以进行读取

RecordOutput和RecordInput是org.apache.hadoop.record包中提供的实用程序接口。有几种实现（例如XMLRecordOutput、BinaryRecordOutput、CSVRecordOutput）

据我所知，您必须实现自己的OutputFormat或InputFormat类才能使用这些。这是相当容易做到的

例如，我在原始问题中提到的OutputFormat（以CSV格式写入整数键和客户值的格式）的实现方式如下：


  private static class CustomerOutputFormat 
    extends TextOutputFormat<IntWritable, Customer> 
  {

    public RecordWriter<IntWritable, Customer> getRecordWriter(FileSystem ignored,
      JobConf job,
      String name,
      Progressable progress)
    throws IOException {
      Path file = FileOutputFormat.getTaskOutputPath(job, name);
      FileSystem fs = file.getFileSystem(job);
      FSDataOutputStream fileOut = fs.create(file, progress);
      return new CustomerRecordWriter(fileOut);
    }   

    protected static class CustomerRecordWriter 
      implements RecordWriter<IntWritable, Customer> 
    {

      protected DataOutputStream outStream ;

      public AnchorRecordWriter(DataOutputStream out) {
        this.outStream = out ; 
      }

      public synchronized void write(IntWritable key, Customer value) throws IOException {

        CsvRecordOutput csvOutput = new CsvRecordOutput(outStream);
        csvOutput.writeInteger(key.get(), "id") ;
        value.serialize(csvOutput) ; 
      }

      public synchronized void close(Reporter reporter) throws IOException {
        outStream.close();
      }
    }
  }


私有静态类CustomerOutputFormat
扩展TextOutputFormat
{
公共RecordWriter getRecordWriter（忽略文件系统，
JobConf job，
字符串名，
（可进展的进展）
抛出IOException{
Path file=FileOutputFormat.getTaskOutputPath（作业，名称）；
FileSystem fs=file.getFileSystem（作业）；
FSDataOutputStream fileOut=fs.create（文件，进度）；
返回新的CustomerRecordWriter（文件输出）；
}   
受保护的静态类CustomerRecordWriter
实现RecordWriter
{
受保护的数据输出流外流；
公共主播记录编写器（DataOutputStream out）{
this.outStream=out；
}
公共同步voi