Pig Java UDF基于位置解析字符串,返回包含生成键的子字符串的包
我有一个列是字符串字段。我需要读入这个字符串,将其存储在一个包中,并给它一个键(因此,当我将其存储为JSON时,它是唯一的) 我的示例数据文件是: “测试,凯尔” 我希望我的输出看起来像: {“测试”:[{“键”:“测试”},{“值”:“凯尔”}]}Pig Java UDF基于位置解析字符串,返回包含生成键的子字符串的包,java,apache-pig,udf,Java,Apache Pig,Udf,我有一个列是字符串字段。我需要读入这个字符串,将其存储在一个包中,并给它一个键(因此,当我将其存储为JSON时,它是唯一的) 我的示例数据文件是: “测试,凯尔” 我希望我的输出看起来像: {“测试”:[{“键”:“测试”},{“值”:“凯尔”}]} public class BagTupleExampleUDF extends EvalFunc<DataBag> { TupleFactory mTupleFactory = TupleFactory.getInstance();
public class BagTupleExampleUDF extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
int counter = 0;
static String fieldNames[] = {"key", "value"};
@Override
public DataBag exec(Tuple tuple) throws IOException {
// expect one string
if (tuple == null || tuple.size() == 0) {
throw new IllegalArgumentException("BagTupleExampleUDF: requires input parameters.");
}
try {
String key = (String) tuple.get(0);
String input = (String) tuple.get(1);
DataBag output = mBagFactory.newDefaultBag();
output.add(mTupleFactory.newTuple(Collections.singletonList(key)));
output.add(mTupleFactory.newTuple(Collections.singletonList(input)));
return output;
}
catch (Exception e) {
throw new IOException("BagTupleExampleUDF: caught exception processing input.", e);
}
}
public Schema outputSchema(Schema input) {
// Function returns a bag with this schema: { (Double), (Double) }
// Thus the outputSchema type should be a Bag containing a Double
try{
Schema bagSchema = new Schema();
String schemaName = getSchemaName(this.getClass().getName().toLowerCase(), input);
bagSchema.add(new Schema.FieldSchema(fieldNames[counter], DataType.CHARARRAY));
counter++;
return new Schema(new Schema.FieldSchema(schemaName, bagSchema, DataType.BAG));
}
catch (Exception e){
throw new RuntimeException(e);
}
}
基本上,我从元组中读入的每个值都需要一个不同的标识键,这样当Pig结束时,我可以引用我添加的这些新列
我对Pig还是很陌生,尤其是UDFS,所以如果需要更多信息,请告诉我
bagSchema.add(new Schema.FieldSchema(fieldNames[counter], DataType.CHARARRAY));