Java 如何读入RCF文件
我试图将一个小的RCFile(约200行数据)读入HashMap中,以进行映射端连接,但在将文件中的数据转换为可用状态时遇到了很多困难 以下是我目前掌握的信息,其中大部分来自: 如何正确读入这些数据,以便一次只能访问一行,例如Java 如何读入RCF文件,java,hadoop,mapreduce,Java,Hadoop,Mapreduce,我试图将一个小的RCFile(约200行数据)读入HashMap中,以进行映射端连接,但在将文件中的数据转换为可用状态时遇到了很多困难 以下是我目前掌握的信息,其中大部分来自: 如何正确读入这些数据,以便一次只能访问一行,例如 (191年,美国,美国,19)?经过进一步挖掘,我找到了一个解决方案。这里的关键是不要使用RCFile.Reader,而是使用RCFileRecordReader 以下是我最后得到的,也适用于打开多个文件: try
(191年,美国,美国,19)
?经过进一步挖掘,我找到了一个解决方案。这里的关键是不要使用RCFile.Reader
,而是使用RCFileRecordReader
以下是我最后得到的,也适用于打开多个文件:
try
{
FileSystem fs = FileSystem.get(job);
FileStatus [] fileStatuses = fs.listStatus(new Path("/path/to/dir/"));
LongWritable key = new LongWritable();
BytesRefArrayWritable value = new BytesRefArrayWritable();
int counter = 1;
for (int i = 0; i < fileStatuses.length; i++)
{
FileStatus fileStatus = fileStatuses[i];
if (!fileStatus.isDir())
{
System.out.println("File: " + fileStatus);
FileSplit split = new FileSplit(fileStatus.getPath(), 0, fileStatus.getLen(), job);
RCFileRecordReader reader = new RCFileRecordReader(job, split);
while (reader.next(key, value))
{
System.out.println("Getting row " + counter);
AllCountriesRow acr = AllCountriesRow.valueOf(value);
System.out.println("ROW: " + acr);
counter++;
}
}
}
}
catch (IOException e)
{
throw new Error(e);
}
这最终会得到一个AllCountriesRow对象,该对象中包含相关行的所有信息。由于RCFile的列性质,行读取路径与写入路径明显不同。我们仍然可以使用RCFile.Reader类按行读取RCFile(不需要RCFileRecordReader)。但除此之外,我们还需要使用ColumnarSerDe将列数据转换为行数据 下面是我们可以得到的最简单的代码,用于逐行读取RCF文件。有关更多详细信息,请参阅内联代码注释
private static void readRCFileByRow(String pathStr)
throws IOException, SerDeException {
final Configuration conf = new Configuration();
final Properties tbl = new Properties();
/*
* Set the column names and types using comma separated strings.
* The actual name of the columns are not important, as long as the count
* of column is correct.
*
* For types, this example uses strings. byte[] can be stored as string
* by encoding the bytes to ASCII (such as hexString or Base64)
*
* Numbers of columns and number of types must match exactly.
*/
tbl.setProperty("columns", "col1,col2,col3,col4,col5");
tbl.setProperty("columns.types", "string:string:string:string:string");
/*
* We need a ColumnarSerDe to de-serialize the columnar data to row-wise
* data
*/
ColumnarSerDe serDe = new ColumnarSerDe();
serDe.initialize(conf, tbl);
Path path = new Path(pathStr);
FileSystem fs = FileSystem.get(conf);
final RCFile.Reader reader = new RCFile.Reader(fs, path, conf);
final LongWritable key = new LongWritable();
final BytesRefArrayWritable cols = new BytesRefArrayWritable();
while (reader.next(key)) {
System.out.println("Getting next row.");
/*
* IMPORTANT: Pass the same cols object to the getCurrentRow API; do not
* create new BytesRefArrayWritable() each time. This is because one call
* to getCurrentRow(cols) can potentially read more than one column
* values which the serde below would take care to read one by one.
*/
reader.getCurrentRow(cols);
final ColumnarStruct row = (ColumnarStruct) serDe.deserialize(cols);
final ArrayList<Object> objects = row.getFieldsAsList();
for (final Object object : objects) {
// Lazy decompression happens here
final String payload =
((LazyString) object).getWritableObject().toString();
System.out.println("Value:" + payload);
}
}
}
private static void readRCFileByRow(字符串路径str)
抛出IOException,SerDeException{
最终配置conf=新配置();
最终属性tbl=新属性();
/*
*使用逗号分隔的字符串设置列名和类型。
*列的实际名称并不重要,只要计数
*列的名称是正确的。
*
*对于类型,此示例使用字符串。字节[]可以存储为字符串
*通过将字节编码为ASCII(如hexString或Base64)
*
*列数和类型数必须完全匹配。
*/
tbl.setProperty(“列”、“列1、列2、列3、列4、列5”);
tbl.setProperty(“columns.types”、“string:string:string:string”);
/*
*我们需要一个ColumnarSerDe将列数据反序列化为行数据
*资料
*/
ColumnarSerDe serDe=新ColumnarSerDe();
serDe.initialize(conf,tbl);
路径路径=新路径(路径str);
FileSystem fs=FileSystem.get(conf);
final RCFile.Reader Reader=new RCFile.Reader(fs,path,conf);
最终LongWritable键=新的LongWritable();
final BytesRefArrayWritable cols=新的BytesRefArrayWritable();
while(reader.next(键)){
System.out.println(“获取下一行”);
/*
*要点:将相同的cols对象传递给getCurrentRow API;不要
*每次创建新的BytesRefArrayWritable()。这是因为一个调用
*to getCurrentRow(cols)可能读取多个列
*下面的serde会注意逐个读取的值。
*/
reader.getCurrentRow(cols);
final ColumnarStruct row=(ColumnarStruct)serDe.deserialize(cols);
final ArrayList objects=row.getFieldsAsList();
用于(最终对象:对象){
//惰性解压发生在这里
最终字符串有效负载=
((LazyString)对象).getWritableObject().toString();
System.out.println(“值:”+有效载荷);
}
}
}
在这段代码中,getCourEntrow仍然按列读取数据,我们需要使用SerDe将其转换为行。另外,调用getCurrentRow()
并不意味着行中的所有字段都已解压缩。实际上,根据惰性解压缩,在对列的一个字段进行反序列化之前,不会对列进行解压缩。为此,我们使用了coulmnarStruct.getFieldsAsList()
来获取对惰性对象的引用列表。实际读取发生在LazyString引用的getWritableObject()
调用中
实现同样目标的另一种方法是使用StructObjectInspector
并使用copyToStandardObject
API。但我发现上面的方法更简单
try
{
FileSystem fs = FileSystem.get(job);
FileStatus [] fileStatuses = fs.listStatus(new Path("/path/to/dir/"));
LongWritable key = new LongWritable();
BytesRefArrayWritable value = new BytesRefArrayWritable();
int counter = 1;
for (int i = 0; i < fileStatuses.length; i++)
{
FileStatus fileStatus = fileStatuses[i];
if (!fileStatus.isDir())
{
System.out.println("File: " + fileStatus);
FileSplit split = new FileSplit(fileStatus.getPath(), 0, fileStatus.getLen(), job);
RCFileRecordReader reader = new RCFileRecordReader(job, split);
while (reader.next(key, value))
{
System.out.println("Getting row " + counter);
AllCountriesRow acr = AllCountriesRow.valueOf(value);
System.out.println("ROW: " + acr);
counter++;
}
}
}
}
catch (IOException e)
{
throw new Error(e);
}
public static AllCountriesRow valueOf(BytesRefArrayWritable braw) throws IOException
{
try
{
StructObjectInspector soi = (StructObjectInspector) serDe.getObjectInspector();
Object row = serDe.deserialize(braw);
List<? extends StructField> fieldRefs = soi.getAllStructFieldRefs();
Object fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.ID.ordinal()));
ObjectInspector oi = fieldRefs.get(Column.ID.ordinal()).getFieldObjectInspector();
int id = ((IntObjectInspector)oi).get(fieldData);
fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.NAME.ordinal()));
oi = fieldRefs.get(Column.NAME.ordinal()).getFieldObjectInspector();
String name = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);
fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.CODE.ordinal()));
oi = fieldRefs.get(Column.CODE.ordinal()).getFieldObjectInspector();
String code = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);
fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.REGION_NAME.ordinal()));
oi = fieldRefs.get(Column.REGION_NAME.ordinal()).getFieldObjectInspector();
String regionName = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);
fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.CONTINENT_ID.ordinal()));
oi = fieldRefs.get(Column.CONTINENT_ID.ordinal()).getFieldObjectInspector();
int continentId = ((IntObjectInspector)oi).get(fieldData);
return new AllCountriesRow(id, name, code, regionName, continentId);
}
catch (SerDeException e)
{
throw new IOException(e);
}
}
private static void readRCFileByRow(String pathStr)
throws IOException, SerDeException {
final Configuration conf = new Configuration();
final Properties tbl = new Properties();
/*
* Set the column names and types using comma separated strings.
* The actual name of the columns are not important, as long as the count
* of column is correct.
*
* For types, this example uses strings. byte[] can be stored as string
* by encoding the bytes to ASCII (such as hexString or Base64)
*
* Numbers of columns and number of types must match exactly.
*/
tbl.setProperty("columns", "col1,col2,col3,col4,col5");
tbl.setProperty("columns.types", "string:string:string:string:string");
/*
* We need a ColumnarSerDe to de-serialize the columnar data to row-wise
* data
*/
ColumnarSerDe serDe = new ColumnarSerDe();
serDe.initialize(conf, tbl);
Path path = new Path(pathStr);
FileSystem fs = FileSystem.get(conf);
final RCFile.Reader reader = new RCFile.Reader(fs, path, conf);
final LongWritable key = new LongWritable();
final BytesRefArrayWritable cols = new BytesRefArrayWritable();
while (reader.next(key)) {
System.out.println("Getting next row.");
/*
* IMPORTANT: Pass the same cols object to the getCurrentRow API; do not
* create new BytesRefArrayWritable() each time. This is because one call
* to getCurrentRow(cols) can potentially read more than one column
* values which the serde below would take care to read one by one.
*/
reader.getCurrentRow(cols);
final ColumnarStruct row = (ColumnarStruct) serDe.deserialize(cols);
final ArrayList<Object> objects = row.getFieldsAsList();
for (final Object object : objects) {
// Lazy decompression happens here
final String payload =
((LazyString) object).getWritableObject().toString();
System.out.println("Value:" + payload);
}
}
}