Apache pig 我可以在pig中的多个列上进行区分吗?

Apache pig 我可以在pig中的多个列上进行区分吗?,apache-pig,Apache Pig,我有一个用例,需要计算两个字段的不同数量 样本: x = LOAD 'testdata' using PigStorage('^A') as (a,b,c,d); y = GROUP x BY a; z = FOREACH y { **bc = DISTINCT x.b,x.c;** dd = DISTINCT x.d; GENERATE FLATTEN(group) as (a), COUNT(bc), COUNT(dd); }; IMH

我有一个用例,需要计算两个字段的不同数量

样本:

x = LOAD 'testdata' using PigStorage('^A') as (a,b,c,d);

y = GROUP x BY a;

z = FOREACH y {

        **bc = DISTINCT x.b,x.c;**
        dd = DISTINCT x.d;
        GENERATE FLATTEN(group) as (a), COUNT(bc), COUNT(dd);
};

IMHO,没有简单的方法(比如MySQL中的
组(DISTINCT a)
),所以您需要拆分表以每行进行两次计数

x = LOAD 'testdata' using PigStorage('^A') as (a,b,c,d);

w1 = FOREACH x GENERATE a, CONCAT(b,c) AS bc;
w2 = FOREACH x GENERATE a, d;

v1 = DISTINCT w1;
v2 = DISTINCT w2;

u1 = GROUP v1 BY a;
u2 = GROUP v2 BY a;

t1 = FOREACH u1 GENERATE group AS a, COUNT(v1.bc);
t2 = FOREACH u2 GENERATE group AS a, COUNT(v2.d);

s = JOIN t1 BY a, t2 BY a;

UDF可以大大简化这一点。

您已经非常接近了。关键是不要将
DISTINCT
应用于两个字段,而是将其应用于您创建的单个复合字段:

x = LOAD 'testdata' using PigStorage('^A') as (a,b,c,d);
x2 = FOREACH x GENERATE a, TOTUPLE(b,c) AS bc, d
y = GROUP x2 BY a;
z = FOREACH y {
        bc = DISTINCT x2.bc;
        dd = DISTINCT x2.d;
        GENERATE FLATTEN(group) AS (a), COUNT(bc), COUNT(dd);
};

所以我们可以说,不可能在每个列中应用distinct,相反,我们可以创建一个单独的distinct,然后执行查询!!有益的