Java 如何读入RCF文件_Java_Hadoop_Mapreduce

Java 如何读入RCF文件

java hadoop mapreduce

Java 如何读入RCF文件,java,hadoop,mapreduce,Java,Hadoop,Mapreduce,我试图将一个小的RCFile（约200行数据）读入HashMap中，以进行映射端连接，但在将文件中的数据转换为可用状态时遇到了很多困难以下是我目前掌握的信息，其中大部分来自：如何正确读入这些数据，以便一次只能访问一行，例如（191年，美国，美国，19）？经过进一步挖掘，我找到了一个解决方案。这里的关键是不要使用RCFile.Reader，而是使用RCFileRecordReader 以下是我最后得到的，也适用于打开多个文件： try

我试图将一个小的RCFile（约200行数据）读入HashMap中，以进行映射端连接，但在将文件中的数据转换为可用状态时遇到了很多困难

以下是我目前掌握的信息，其中大部分来自：

如何正确读入这些数据，以便一次只能访问一行，例如

（191年，美国，美国，19）

？

经过进一步挖掘，我找到了一个解决方案。这里的关键是不要使用

RCFile.Reader

，而是使用

RCFileRecordReader

以下是我最后得到的，也适用于打开多个文件：

try                                                                                                                                     
{                                                                     
    FileSystem fs = FileSystem.get(job);                                                                                         
    FileStatus [] fileStatuses = fs.listStatus(new Path("/path/to/dir/"));                               
    LongWritable key = new LongWritable();                                                                                       
    BytesRefArrayWritable value = new BytesRefArrayWritable();                                                                   
    int counter = 1;                                                                                                             
    for (int i = 0; i < fileStatuses.length; i++)                                                                                
    {                                                                                                                            
        FileStatus fileStatus = fileStatuses[i];                                                                                 
        if (!fileStatus.isDir())                                                                                                 
        {                                                                                                                        
            System.out.println("File: " + fileStatus);                                                                           
            FileSplit split = new FileSplit(fileStatus.getPath(), 0, fileStatus.getLen(), job);                                  
            RCFileRecordReader reader = new RCFileRecordReader(job, split);                                                      
            while (reader.next(key, value))                                                                                      
            {                                                                                                                    
                System.out.println("Getting row " + counter);                                                                    
                AllCountriesRow acr = AllCountriesRow.valueOf(value);                                                            
                System.out.println("ROW: " + acr);                                                                                                                                                        
                counter++;                                                                                                       
            }                                                                                                                    
        }                                                                                                                        
    }                                                                                                                                                                                                                                                         
}                                                                                                                                
catch (IOException e)                                                                                                            
{                                                                                                                                
    throw new Error(e);                                                                                                          
}

这最终会得到一个AllCountriesRow对象，该对象中包含相关行的所有信息。

由于RCFile的列性质，行读取路径与写入路径明显不同。我们仍然可以使用RCFile.Reader类按行读取RCFile（不需要RCFileRecordReader）。但除此之外，我们还需要使用ColumnarSerDe将列数据转换为行数据

下面是我们可以得到的最简单的代码，用于逐行读取RCF文件。有关更多详细信息，请参阅内联代码注释

private static void readRCFileByRow(String pathStr)
  throws IOException, SerDeException {

  final Configuration conf = new Configuration();

  final Properties tbl = new Properties();

  /*
   * Set the column names and types using comma separated strings. 
   * The actual name of the columns are not important, as long as the count 
   * of column is correct.
   * 
   * For types, this example uses strings. byte[] can be stored as string 
   * by encoding the bytes to ASCII (such as hexString or Base64)
   * 
   * Numbers of columns and number of types must match exactly.
   */
  tbl.setProperty("columns", "col1,col2,col3,col4,col5");
  tbl.setProperty("columns.types", "string:string:string:string:string");

  /*
   * We need a ColumnarSerDe to de-serialize the columnar data to row-wise 
   * data 
   */
  ColumnarSerDe serDe = new ColumnarSerDe();
  serDe.initialize(conf, tbl);

  Path path = new Path(pathStr);
  FileSystem fs = FileSystem.get(conf);
  final RCFile.Reader reader = new RCFile.Reader(fs, path, conf);

  final LongWritable key = new LongWritable();
  final BytesRefArrayWritable cols = new BytesRefArrayWritable();

  while (reader.next(key)) {
    System.out.println("Getting next row.");

    /*
     * IMPORTANT: Pass the same cols object to the getCurrentRow API; do not 
     * create new BytesRefArrayWritable() each time. This is because one call
     * to getCurrentRow(cols) can potentially read more than one column
     * values which the serde below would take care to read one by one.
     */
    reader.getCurrentRow(cols);

    final ColumnarStruct row = (ColumnarStruct) serDe.deserialize(cols);
    final ArrayList<Object> objects = row.getFieldsAsList();
    for (final Object object : objects) {
      // Lazy decompression happens here
      final String payload = 
        ((LazyString) object).getWritableObject().toString();
      System.out.println("Value:" + payload);
    }
  }
}

private static void readRCFileByRow（字符串路径str）
抛出IOException，SerDeException{
最终配置conf=新配置（）；
最终属性tbl=新属性（）；
/*
*使用逗号分隔的字符串设置列名和类型。
*列的实际名称并不重要，只要计数
*列的名称是正确的。
* 
*对于类型，此示例使用字符串。字节[]可以存储为字符串
*通过将字节编码为ASCII（如hexString或Base64）
* 
*列数和类型数必须完全匹配。
*/
tbl.setProperty（“列”、“列1、列2、列3、列4、列5”）；
tbl.setProperty（“columns.types”、“string:string:string:string”）；
/*
*我们需要一个ColumnarSerDe将列数据反序列化为行数据
*资料
*/
ColumnarSerDe serDe=新ColumnarSerDe（）；
serDe.initialize（conf，tbl）；
路径路径=新路径（路径str）；
FileSystem fs=FileSystem.get（conf）；
final RCFile.Reader Reader=new RCFile.Reader（fs，path，conf）；
最终LongWritable键=新的LongWritable（）；
final BytesRefArrayWritable cols=新的BytesRefArrayWritable（）；
while（reader.next（键））{
System.out.println（“获取下一行”）；
/*
*要点：将相同的cols对象传递给getCurrentRow API；不要
*每次创建新的BytesRefArrayWritable（）。这是因为一个调用
*to getCurrentRow（cols）可能读取多个列
*下面的serde会注意逐个读取的值。
*/
reader.getCurrentRow（cols）；
final ColumnarStruct row=（ColumnarStruct）serDe.deserialize（cols）；
final ArrayList objects=row.getFieldsAsList（）；
用于（最终对象：对象）{
//惰性解压发生在这里
最终字符串有效负载=
（（LazyString）对象）.getWritableObject（）.toString（）；
System.out.println（“值：”+有效载荷）；
}
}
}

在这段代码中，getCourEntrow仍然按列读取数据，我们需要使用SerDe将其转换为行。另外，调用

getCurrentRow（）

并不意味着行中的所有字段都已解压缩。实际上，根据惰性解压缩，在对列的一个字段进行反序列化之前，不会对列进行解压缩。为此，我们使用了

coulmnarStruct.getFieldsAsList（）

来获取对惰性对象的引用列表。实际读取发生在LazyString引用的

getWritableObject（）

调用中

实现同样目标的另一种方法是使用

StructObjectInspector

并使用

copyToStandardObject

API。但我发现上面的方法更简单

try                                                                                                                                     
{                                                                     
    FileSystem fs = FileSystem.get(job);                                                                                         
    FileStatus [] fileStatuses = fs.listStatus(new Path("/path/to/dir/"));                               
    LongWritable key = new LongWritable();                                                                                       
    BytesRefArrayWritable value = new BytesRefArrayWritable();                                                                   
    int counter = 1;                                                                                                             
    for (int i = 0; i < fileStatuses.length; i++)                                                                                
    {                                                                                                                            
        FileStatus fileStatus = fileStatuses[i];                                                                                 
        if (!fileStatus.isDir())                                                                                                 
        {                                                                                                                        
            System.out.println("File: " + fileStatus);                                                                           
            FileSplit split = new FileSplit(fileStatus.getPath(), 0, fileStatus.getLen(), job);                                  
            RCFileRecordReader reader = new RCFileRecordReader(job, split);                                                      
            while (reader.next(key, value))                                                                                      
            {                                                                                                                    
                System.out.println("Getting row " + counter);                                                                    
                AllCountriesRow acr = AllCountriesRow.valueOf(value);                                                            
                System.out.println("ROW: " + acr);                                                                                                                                                        
                counter++;                                                                                                       
            }                                                                                                                    
        }                                                                                                                        
    }                                                                                                                                                                                                                                                         
}                                                                                                                                
catch (IOException e)                                                                                                            
{                                                                                                                                
    throw new Error(e);                                                                                                          
}

public static AllCountriesRow valueOf(BytesRefArrayWritable braw) throws IOException                                                     
{   
    try                                                                                                                                  
    {
        StructObjectInspector soi = (StructObjectInspector) serDe.getObjectInspector();                                                  
        Object row = serDe.deserialize(braw);                                                                                                                                                                                 
        List<? extends StructField> fieldRefs = soi.getAllStructFieldRefs();                                                                                                                                              

        Object fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.ID.ordinal()));                                                                  
        ObjectInspector oi = fieldRefs.get(Column.ID.ordinal()).getFieldObjectInspector();                                               
        int id = ((IntObjectInspector)oi).get(fieldData);                                                                                

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.NAME.ordinal()));                                                   
        oi = fieldRefs.get(Column.NAME.ordinal()).getFieldObjectInspector();                                                             
        String name = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);                                                     

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.CODE.ordinal()));                                                   
        oi = fieldRefs.get(Column.CODE.ordinal()).getFieldObjectInspector();
        String code = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);                                                     

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.REGION_NAME.ordinal()));                                            
        oi = fieldRefs.get(Column.REGION_NAME.ordinal()).getFieldObjectInspector();                                                      
        String regionName = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);                                               

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.CONTINENT_ID.ordinal()));                                           
        oi = fieldRefs.get(Column.CONTINENT_ID.ordinal()).getFieldObjectInspector();                                                     
        int continentId = ((IntObjectInspector)oi).get(fieldData);                                                                       

        return new AllCountriesRow(id, name, code, regionName, continentId);                                                             
    }               
    catch (SerDeException e)
    {               
        throw new IOException(e);                                                                                                        
    }                   
}

private static void readRCFileByRow(String pathStr)
  throws IOException, SerDeException {

  final Configuration conf = new Configuration();

  final Properties tbl = new Properties();

  /*
   * Set the column names and types using comma separated strings. 
   * The actual name of the columns are not important, as long as the count 
   * of column is correct.
   * 
   * For types, this example uses strings. byte[] can be stored as string 
   * by encoding the bytes to ASCII (such as hexString or Base64)
   * 
   * Numbers of columns and number of types must match exactly.
   */
  tbl.setProperty("columns", "col1,col2,col3,col4,col5");
  tbl.setProperty("columns.types", "string:string:string:string:string");

  /*
   * We need a ColumnarSerDe to de-serialize the columnar data to row-wise 
   * data 
   */
  ColumnarSerDe serDe = new ColumnarSerDe();
  serDe.initialize(conf, tbl);

  Path path = new Path(pathStr);
  FileSystem fs = FileSystem.get(conf);
  final RCFile.Reader reader = new RCFile.Reader(fs, path, conf);

  final LongWritable key = new LongWritable();
  final BytesRefArrayWritable cols = new BytesRefArrayWritable();

  while (reader.next(key)) {
    System.out.println("Getting next row.");

    /*
     * IMPORTANT: Pass the same cols object to the getCurrentRow API; do not 
     * create new BytesRefArrayWritable() each time. This is because one call
     * to getCurrentRow(cols) can potentially read more than one column
     * values which the serde below would take care to read one by one.
     */
    reader.getCurrentRow(cols);

    final ColumnarStruct row = (ColumnarStruct) serDe.deserialize(cols);
    final ArrayList<Object> objects = row.getFieldsAsList();
    for (final Object object : objects) {
      // Lazy decompression happens here
      final String payload = 
        ((LazyString) object).getWritableObject().toString();
      System.out.println("Value:" + payload);
    }
  }
}