Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/319.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 如何读入RCF文件_Java_Hadoop_Mapreduce - Fatal编程技术网

Java 如何读入RCF文件

Java 如何读入RCF文件,java,hadoop,mapreduce,Java,Hadoop,Mapreduce,我试图将一个小的RCFile(约200行数据)读入HashMap中,以进行映射端连接,但在将文件中的数据转换为可用状态时遇到了很多困难 以下是我目前掌握的信息,其中大部分来自: 如何正确读入这些数据,以便一次只能访问一行,例如 (191年,美国,美国,19)?经过进一步挖掘,我找到了一个解决方案。这里的关键是不要使用RCFile.Reader,而是使用RCFileRecordReader 以下是我最后得到的,也适用于打开多个文件: try

我试图将一个小的RCFile(约200行数据)读入HashMap中,以进行映射端连接,但在将文件中的数据转换为可用状态时遇到了很多困难

以下是我目前掌握的信息,其中大部分来自:

如何正确读入这些数据,以便一次只能访问一行,例如


(191年,美国,美国,19)

经过进一步挖掘,我找到了一个解决方案。这里的关键是不要使用
RCFile.Reader
,而是使用
RCFileRecordReader

以下是我最后得到的,也适用于打开多个文件:

try                                                                                                                                     
{                                                                     
    FileSystem fs = FileSystem.get(job);                                                                                         
    FileStatus [] fileStatuses = fs.listStatus(new Path("/path/to/dir/"));                               
    LongWritable key = new LongWritable();                                                                                       
    BytesRefArrayWritable value = new BytesRefArrayWritable();                                                                   
    int counter = 1;                                                                                                             
    for (int i = 0; i < fileStatuses.length; i++)                                                                                
    {                                                                                                                            
        FileStatus fileStatus = fileStatuses[i];                                                                                 
        if (!fileStatus.isDir())                                                                                                 
        {                                                                                                                        
            System.out.println("File: " + fileStatus);                                                                           
            FileSplit split = new FileSplit(fileStatus.getPath(), 0, fileStatus.getLen(), job);                                  
            RCFileRecordReader reader = new RCFileRecordReader(job, split);                                                      
            while (reader.next(key, value))                                                                                      
            {                                                                                                                    
                System.out.println("Getting row " + counter);                                                                    
                AllCountriesRow acr = AllCountriesRow.valueOf(value);                                                            
                System.out.println("ROW: " + acr);                                                                                                                                                        
                counter++;                                                                                                       
            }                                                                                                                    
        }                                                                                                                        
    }                                                                                                                                                                                                                                                         
}                                                                                                                                
catch (IOException e)                                                                                                            
{                                                                                                                                
    throw new Error(e);                                                                                                          
}

这最终会得到一个AllCountriesRow对象,该对象中包含相关行的所有信息。

由于RCFile的列性质,行读取路径与写入路径明显不同。我们仍然可以使用RCFile.Reader类按行读取RCFile(不需要RCFileRecordReader)。但除此之外,我们还需要使用ColumnarSerDe将列数据转换为行数据

下面是我们可以得到的最简单的代码,用于逐行读取RCF文件。有关更多详细信息,请参阅内联代码注释

private static void readRCFileByRow(String pathStr)
  throws IOException, SerDeException {

  final Configuration conf = new Configuration();

  final Properties tbl = new Properties();

  /*
   * Set the column names and types using comma separated strings. 
   * The actual name of the columns are not important, as long as the count 
   * of column is correct.
   * 
   * For types, this example uses strings. byte[] can be stored as string 
   * by encoding the bytes to ASCII (such as hexString or Base64)
   * 
   * Numbers of columns and number of types must match exactly.
   */
  tbl.setProperty("columns", "col1,col2,col3,col4,col5");
  tbl.setProperty("columns.types", "string:string:string:string:string");

  /*
   * We need a ColumnarSerDe to de-serialize the columnar data to row-wise 
   * data 
   */
  ColumnarSerDe serDe = new ColumnarSerDe();
  serDe.initialize(conf, tbl);

  Path path = new Path(pathStr);
  FileSystem fs = FileSystem.get(conf);
  final RCFile.Reader reader = new RCFile.Reader(fs, path, conf);

  final LongWritable key = new LongWritable();
  final BytesRefArrayWritable cols = new BytesRefArrayWritable();

  while (reader.next(key)) {
    System.out.println("Getting next row.");

    /*
     * IMPORTANT: Pass the same cols object to the getCurrentRow API; do not 
     * create new BytesRefArrayWritable() each time. This is because one call
     * to getCurrentRow(cols) can potentially read more than one column
     * values which the serde below would take care to read one by one.
     */
    reader.getCurrentRow(cols);

    final ColumnarStruct row = (ColumnarStruct) serDe.deserialize(cols);
    final ArrayList<Object> objects = row.getFieldsAsList();
    for (final Object object : objects) {
      // Lazy decompression happens here
      final String payload = 
        ((LazyString) object).getWritableObject().toString();
      System.out.println("Value:" + payload);
    }
  }
}
private static void readRCFileByRow(字符串路径str)
抛出IOException,SerDeException{
最终配置conf=新配置();
最终属性tbl=新属性();
/*
*使用逗号分隔的字符串设置列名和类型。
*列的实际名称并不重要,只要计数
*列的名称是正确的。
* 
*对于类型,此示例使用字符串。字节[]可以存储为字符串
*通过将字节编码为ASCII(如hexString或Base64)
* 
*列数和类型数必须完全匹配。
*/
tbl.setProperty(“列”、“列1、列2、列3、列4、列5”);
tbl.setProperty(“columns.types”、“string:string:string:string”);
/*
*我们需要一个ColumnarSerDe将列数据反序列化为行数据
*资料
*/
ColumnarSerDe serDe=新ColumnarSerDe();
serDe.initialize(conf,tbl);
路径路径=新路径(路径str);
FileSystem fs=FileSystem.get(conf);
final RCFile.Reader Reader=new RCFile.Reader(fs,path,conf);
最终LongWritable键=新的LongWritable();
final BytesRefArrayWritable cols=新的BytesRefArrayWritable();
while(reader.next(键)){
System.out.println(“获取下一行”);
/*
*要点:将相同的cols对象传递给getCurrentRow API;不要
*每次创建新的BytesRefArrayWritable()。这是因为一个调用
*to getCurrentRow(cols)可能读取多个列
*下面的serde会注意逐个读取的值。
*/
reader.getCurrentRow(cols);
final ColumnarStruct row=(ColumnarStruct)serDe.deserialize(cols);
final ArrayList objects=row.getFieldsAsList();
用于(最终对象:对象){
//惰性解压发生在这里
最终字符串有效负载=
((LazyString)对象).getWritableObject().toString();
System.out.println(“值:”+有效载荷);
}
}
}
在这段代码中,getCourEntrow仍然按列读取数据,我们需要使用SerDe将其转换为行。另外,调用
getCurrentRow()
并不意味着行中的所有字段都已解压缩。实际上,根据惰性解压缩,在对列的一个字段进行反序列化之前,不会对列进行解压缩。为此,我们使用了
coulmnarStruct.getFieldsAsList()
来获取对惰性对象的引用列表。实际读取发生在LazyString引用的
getWritableObject()
调用中

实现同样目标的另一种方法是使用
StructObjectInspector
并使用
copyToStandardObject
API。但我发现上面的方法更简单

try                                                                                                                                     
{                                                                     
    FileSystem fs = FileSystem.get(job);                                                                                         
    FileStatus [] fileStatuses = fs.listStatus(new Path("/path/to/dir/"));                               
    LongWritable key = new LongWritable();                                                                                       
    BytesRefArrayWritable value = new BytesRefArrayWritable();                                                                   
    int counter = 1;                                                                                                             
    for (int i = 0; i < fileStatuses.length; i++)                                                                                
    {                                                                                                                            
        FileStatus fileStatus = fileStatuses[i];                                                                                 
        if (!fileStatus.isDir())                                                                                                 
        {                                                                                                                        
            System.out.println("File: " + fileStatus);                                                                           
            FileSplit split = new FileSplit(fileStatus.getPath(), 0, fileStatus.getLen(), job);                                  
            RCFileRecordReader reader = new RCFileRecordReader(job, split);                                                      
            while (reader.next(key, value))                                                                                      
            {                                                                                                                    
                System.out.println("Getting row " + counter);                                                                    
                AllCountriesRow acr = AllCountriesRow.valueOf(value);                                                            
                System.out.println("ROW: " + acr);                                                                                                                                                        
                counter++;                                                                                                       
            }                                                                                                                    
        }                                                                                                                        
    }                                                                                                                                                                                                                                                         
}                                                                                                                                
catch (IOException e)                                                                                                            
{                                                                                                                                
    throw new Error(e);                                                                                                          
}
public static AllCountriesRow valueOf(BytesRefArrayWritable braw) throws IOException                                                     
{   
    try                                                                                                                                  
    {
        StructObjectInspector soi = (StructObjectInspector) serDe.getObjectInspector();                                                  
        Object row = serDe.deserialize(braw);                                                                                                                                                                                 
        List<? extends StructField> fieldRefs = soi.getAllStructFieldRefs();                                                                                                                                              

        Object fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.ID.ordinal()));                                                                  
        ObjectInspector oi = fieldRefs.get(Column.ID.ordinal()).getFieldObjectInspector();                                               
        int id = ((IntObjectInspector)oi).get(fieldData);                                                                                

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.NAME.ordinal()));                                                   
        oi = fieldRefs.get(Column.NAME.ordinal()).getFieldObjectInspector();                                                             
        String name = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);                                                     

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.CODE.ordinal()));                                                   
        oi = fieldRefs.get(Column.CODE.ordinal()).getFieldObjectInspector();
        String code = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);                                                     

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.REGION_NAME.ordinal()));                                            
        oi = fieldRefs.get(Column.REGION_NAME.ordinal()).getFieldObjectInspector();                                                      
        String regionName = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);                                               

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.CONTINENT_ID.ordinal()));                                           
        oi = fieldRefs.get(Column.CONTINENT_ID.ordinal()).getFieldObjectInspector();                                                     
        int continentId = ((IntObjectInspector)oi).get(fieldData);                                                                       

        return new AllCountriesRow(id, name, code, regionName, continentId);                                                             
    }               
    catch (SerDeException e)
    {               
        throw new IOException(e);                                                                                                        
    }                   
}                       
private static void readRCFileByRow(String pathStr)
  throws IOException, SerDeException {

  final Configuration conf = new Configuration();

  final Properties tbl = new Properties();

  /*
   * Set the column names and types using comma separated strings. 
   * The actual name of the columns are not important, as long as the count 
   * of column is correct.
   * 
   * For types, this example uses strings. byte[] can be stored as string 
   * by encoding the bytes to ASCII (such as hexString or Base64)
   * 
   * Numbers of columns and number of types must match exactly.
   */
  tbl.setProperty("columns", "col1,col2,col3,col4,col5");
  tbl.setProperty("columns.types", "string:string:string:string:string");

  /*
   * We need a ColumnarSerDe to de-serialize the columnar data to row-wise 
   * data 
   */
  ColumnarSerDe serDe = new ColumnarSerDe();
  serDe.initialize(conf, tbl);

  Path path = new Path(pathStr);
  FileSystem fs = FileSystem.get(conf);
  final RCFile.Reader reader = new RCFile.Reader(fs, path, conf);

  final LongWritable key = new LongWritable();
  final BytesRefArrayWritable cols = new BytesRefArrayWritable();

  while (reader.next(key)) {
    System.out.println("Getting next row.");

    /*
     * IMPORTANT: Pass the same cols object to the getCurrentRow API; do not 
     * create new BytesRefArrayWritable() each time. This is because one call
     * to getCurrentRow(cols) can potentially read more than one column
     * values which the serde below would take care to read one by one.
     */
    reader.getCurrentRow(cols);

    final ColumnarStruct row = (ColumnarStruct) serDe.deserialize(cols);
    final ArrayList<Object> objects = row.getFieldsAsList();
    for (final Object object : objects) {
      // Lazy decompression happens here
      final String payload = 
        ((LazyString) object).getWritableObject().toString();
      System.out.println("Value:" + payload);
    }
  }
}