Apache pig 规范化pig脚本中的数据

Apache pig 规范化pig脚本中的数据,apache-pig,Apache Pig,我有以下数据集: 1,11,ab;cd;200 2,22,pq;rs 我希望在输出中包含以下内容: 1,11,ab 1,11,cd 1,11,200 2,22,pq 2,22,rs 如何在不使用任何自定义项的情况下在Pig中完成此操作?您可以执行以下操作: A = load '....' using PigStorage(',') as (x,y,data : chararray); SPLT = foreach A generate x, y, FLATTEN(STRSPLIT(

我有以下数据集:

1,11,ab;cd;200

2,22,pq;rs
我希望在输出中包含以下内容:

1,11,ab

1,11,cd

1,11,200

2,22,pq

2,22,rs

如何在不使用任何自定义项的情况下在Pig中完成此操作?

您可以执行以下操作:

A = load '....' using PigStorage(',') as (x,y,data : chararray);
SPLT = foreach A generate x, y, FLATTEN(STRSPLIT(data,';'));
X_tmp = foreach SPLT generate $0 as x, $1 as y, FLATTEN(TOBAG($2..$20)) as term; -- pivots the row
X = filter X_tmp by term is not null; -- this removes the extra bag rows when title was split in less than 20 terms

假设数据字符串中的元素不超过20个。如果你有更多,就增加它。

试试这个

    A = load 'data' using PigStorage(',') as (x,y,data:chararray);
    SPLT = foreach A generate x, y, FLATTEN(STRSPLIT(data,';',3)) as (a,b,c);
    grp = group SPLT by (x,y);
    res = foreach grp generate group, FLATTEN(SPLT);
    out = foreach res generate FLATTEN(group), FLATTEN(TOBAG(SPLT::a, SPLT::b, strong textSPLT::c)) as val;
    dump out;