Apache pig 从PIG中的另一个关系筛选列

Apache pig 从PIG中的另一个关系筛选列,apache-pig,Apache Pig,假设,我在PIG中有以下数据 DUMP raw; (2015-09-15T22:11:00.000-07:00,1) (2015-09-15T22:12:00.000-07:00,2) (2015-09-15T23:11:00.000-07:00,3) (2015-09-16T21:02:00.000-07:00,4) (2015-09-15T00:02:00.000-07:00,5) (2015-09-17T08:02:00.000-07:00,5) (2015-09-17T09:02:00.

假设,我在PIG中有以下数据

DUMP raw;
(2015-09-15T22:11:00.000-07:00,1)
(2015-09-15T22:12:00.000-07:00,2)
(2015-09-15T23:11:00.000-07:00,3)
(2015-09-16T21:02:00.000-07:00,4)
(2015-09-15T00:02:00.000-07:00,5)
(2015-09-17T08:02:00.000-07:00,5)
(2015-09-17T09:02:00.000-07:00,5)
(2015-09-17T09:02:00.000-07:00,1)
(2015-09-17T19:02:00.000-07:00,1)

DESCRIBE raw;
raw: {process_date: chararray,id: int}

A = GROUP raw BY id;
DESCRIBE A;
A: {group: int,raw: {(process_date: chararray,id: int)}}
DUMP A;

 (1,{(2015-09-15T22:11:00.000-07:00,1),(2015-09-17T09:02:00.000-07:00,1),(2015-09-17T19:02:00.000-07:00,1)})
(2,{(2015-09-15T22:12:00.000-07:00,2)})
(3,{(2015-09-15T23:11:00.000-07:00,3)})
(4,{(2015-09-16T21:02:00.000-07:00,4)})
(5,{(2015-09-15T00:02:00.000-07:00,5),(2015-09-17T08:02:00.000-07:00,5),(2015-09-17T09:02:00.000-07:00,5)})


    B = FOREACH A {generate raw,MAX(raw.process_date) AS max_date;}
    DUMP B;
        ({(2015-09-15T22:11:00.000-07:00,1),(2015-09-17T09:02:00.000-07:00,1),(2015-09-17T19:02:00.000-07:00,1)},2015-09-17T19:02:00.000-07:00)
({(2015-09-15T22:12:00.000-07:00,2)},2015-09-15T22:12:00.000-07:00)
({(2015-09-15T23:11:00.000-07:00,3)},2015-09-15T23:11:00.000-07:00)
({(2015-09-16T21:02:00.000-07:00,4)},2015-09-16T21:02:00.000-07:00)
({(2015-09-15T00:02:00.000-07:00,5),(2015-09-17T08:02:00.000-07:00,5),(2015-09-17T09:02:00.000-07:00,5)},2015-09-17T09:02:00.000-07:00)

    DESCRIBE B;
    B: {raw: {(process_date: chararray,id: int)},max_date: chararray}
现在,我需要根据process_date eq max_date过滤原始数据。我尝试了以下方法:

C = FOREACH B {filtered = FILTER raw BY REGEX_EXTRACT(process_date,'(\\d{4}-\\d{2}-\\d{2})',1) eq REGEX_EXTRACT(max_date,'(\\d{4}-\\d{2}-\\d{2})',1)}, but its not working.
有没有办法进行这种过滤?基本上,我需要根据最新日期过滤原始数据。 我得到的例外是:

Invalid field projection. Projected field [max_date] does not exist in schema: process_date:chararray,id:int
预期输出:每个id的最新日期(而非时间)对应的最新数据

({(2015-09-17T09:02:00.000-07:00,1),(2015-09-17T19:02:00.000-07:00,1)})
({(2015-09-15T22:12:00.000-07:00,2)})
({(2015-09-15T23:11:00.000-07:00,3)})
({(2015-09-16T21:02:00.000-07:00,4)})
({(2015-09-17T08:02:00.000-07:00,5),(2015-09-17T09:02:00.000-07:00,5)})

您已经使用C作为关系,请尝试使用另一个,这不是问题所在。我也试过使用其他名称,它不起作用,但我也可以使用C,因为我第一次在嵌套的FOREACH中使用C,它不暴露于外部FOREACH。甚至C也可以在外部使用,但问题是使用了其他关系中的筛选子句。您可以向我们展示您所需的输出吗?所需的输出类似于:({(2015-09-17T00:02:00.000-07:00,5)),正如我前面所说,基于max的原始数据过滤_date@Ravi:您的目标是根据日期选择最新记录吗?您已经使用C作为关系,请尝试使用其他选项。这不是问题所在。我也试过使用其他名称,它不起作用,但我也可以使用C,因为我第一次在嵌套的FOREACH中使用C,它不暴露于外部FOREACH。甚至C也可以在外部使用,但问题是使用了来自其他关系的筛选子句。您能给我们显示您所需的输出吗?所需的输出类似于:({(2015-09-17T00:02:00.000-07:00,5)}),正如我前面所说,基于max筛选原始数据_date@Ravi:您的目标是根据日期选择最新记录吗?