Pig Java UDF基于位置解析字符串，返回包含生成键的子字符串的包_Java_Apache Pig_Udf

Pig Java UDF基于位置解析字符串，返回包含生成键的子字符串的包

java apache-pig

Pig Java UDF基于位置解析字符串，返回包含生成键的子字符串的包,java,apache-pig,udf,Java,Apache Pig,Udf,我有一个列是字符串字段。我需要读入这个字符串，将其存储在一个包中，并给它一个键（因此，当我将其存储为JSON时，它是唯一的）我的示例数据文件是： “测试，凯尔” 我希望我的输出看起来像： {“测试”：[{“键”：“测试”}，{“值”：“凯尔”}]} public class BagTupleExampleUDF extends EvalFunc<DataBag> { TupleFactory mTupleFactory = TupleFactory.getInstance();

我有一个列是字符串字段。我需要读入这个字符串，将其存储在一个包中，并给它一个键（因此，当我将其存储为JSON时，它是唯一的）

我的示例数据文件是： “测试，凯尔”

我希望我的输出看起来像： {“测试”：[{“键”：“测试”}，{“值”：“凯尔”}]}

public class BagTupleExampleUDF extends EvalFunc<DataBag> {

TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
int counter = 0;
static String fieldNames[] = {"key", "value"};


@Override
public DataBag exec(Tuple tuple) throws IOException {
    // expect one string
    if (tuple == null || tuple.size() == 0) {
        throw new IllegalArgumentException("BagTupleExampleUDF: requires input parameters.");
    }
    try {

        String key = (String) tuple.get(0);
        String input = (String) tuple.get(1);

        DataBag output = mBagFactory.newDefaultBag();

            output.add(mTupleFactory.newTuple(Collections.singletonList(key)));
            output.add(mTupleFactory.newTuple(Collections.singletonList(input)));


        return output;
    }
    catch (Exception e) {
        throw new IOException("BagTupleExampleUDF: caught exception processing input.", e);
    }
}

public Schema outputSchema(Schema input) {
    // Function returns a bag with this schema: { (Double), (Double) }
    // Thus the outputSchema type should be a Bag containing a Double
    try{

        Schema bagSchema = new Schema();
        String schemaName =  getSchemaName(this.getClass().getName().toLowerCase(), input);

        bagSchema.add(new Schema.FieldSchema(fieldNames[counter], DataType.CHARARRAY));
        counter++;
        return new Schema(new Schema.FieldSchema(schemaName, bagSchema, DataType.BAG));
        }
    catch (Exception e){
        throw new RuntimeException(e);
    }
}

基本上，我从元组中读入的每个值都需要一个不同的标识键，这样当Pig结束时，我可以引用我添加的这些新列

我对Pig还是很陌生，尤其是UDFS，所以如果需要更多信息，请告诉我

bagSchema.add(new Schema.FieldSchema(fieldNames[counter], DataType.CHARARRAY));