Apache pig 我可以使用Pig拉丁语中的嵌套FOREACH语句生成嵌套包吗?

Apache pig 我可以使用Pig拉丁语中的嵌套FOREACH语句生成嵌套包吗?,apache-pig,Apache Pig,假设我有一组餐厅评论数据: User,City,Restaurant,Rating Jim,New York,Mecurials,3 Jim,New York,Whapme,4.5 Jim,London,Pint Size,2 Lisa,London,Pint Size,4 Lisa,London,Rabbit Whole,3.5 我想按用户和平均评论城市列出一个列表。即输出: User,City,AverageRating Jim,New York,3.75 Jim,London,2 Lis

假设我有一组餐厅评论数据:

User,City,Restaurant,Rating
Jim,New York,Mecurials,3
Jim,New York,Whapme,4.5
Jim,London,Pint Size,2
Lisa,London,Pint Size,4
Lisa,London,Rabbit Whole,3.5
我想按用户和平均评论城市列出一个列表。即输出:

User,City,AverageRating
Jim,New York,3.75
Jim,London,2
Lisa,London,3.75
我可以写一个猪脚本如下:

Data = LOAD 'data.txt' USING PigStorage(',') AS (
    user:chararray, city:chararray, restaurant:charray, rating:float
);

PerUserCity = GROUP Data BY (user, city);

ResultSet = FOREACH PerUserCity {
    GENERATE group.user, group.city, AVG(Data.rating);
}
然而,我很好奇,我是否可以先将较高级别的组(用户)分组,然后再将下一级别(城市)分组:即

PerUser = GROUP Data BY user;

Intermediate = FOREACH PerUser {
    B = GROUP Data BY city;
    GENERATE group AS user, B;
}
我得到:

Error during parsing.
Invalid alias: GROUP in {
  group: chararray,
  Data: {
    user: chararray,
    city: chararray,
    restaurant: chararray,
    rating: float
  }
}
有人成功地尝试过这个吗?在一个FOREACH中分组是不可能的吗

我的目标是做如下事情:

ResultSet = FOREACH PerUser {
    FOREACH City {
        GENERATE user, city, AVG(City.rating)
    }
}

当前允许的操作有
DISTINCT
FILTER
LIMIT
ORDER BY


目前,直接按(用户、城市)分组是按您所说的方式进行分组的好方法。

Pig版本0.10的发行说明建议嵌套的FOREACH操作是。

尝试以下操作:

Records = load 'data_rating.txt' using PigStorage(',') as (user:chararray, city:chararray, restaurant:chararray, rating:float);
grpRecs = group Records By (user,city);
avgRating_Byuser_perCity = foreach grpRecs generate AVG(Records.rating) as average; 
Result = foreach avgRating_Byuser_perCity generate flatten(group), average;

通过两个键进行分组,然后展平结构,可以得到相同的结果:

像你那样加载数据

Data = LOAD 'data.txt' USING PigStorage(',') AS (
    user:chararray, city:chararray, restaurant:charray, rating:float);
按用户和城市分组

 ByUserByCity = GROUP Data BY (user, city);
添加组的平均评分(您可以添加更多,如COUNT(数据)作为COUNT\u res) 然后将组结构展平到原始组结构

ByUserByCityAvg = FOREACH ByUserByCity GENERATE
FLATTEN(group) AS (user, city),
AVG(Data.rating) as user_city_avg;
结果:

Jim,London,2.0
Jim,New York,3.75
Lisa,London,3.75
User,City,

非常感谢。为什么在内部块中需要两个生成?收回我的建议。发行说明建议可以这样做,但我不能让它工作。你应该添加一个描述这段代码完成了什么以及它是如何做到的。这是错误的。。。它应该是Records=load'data_rating.txt',使用PigStorage(',')作为(用户:chararray,城市:chararray,餐厅:chararray,评级:float);grpRecs=按(用户、城市)分组的记录;avgRating_Byuser_perCity=每个GRPREC生成扁平化(组),AVG(记录.评级)作为平均值;结果=用户百分比的转储平均值;我想,这并不能回答这个问题
Jim,London,2.0
Jim,New York,3.75
Lisa,London,3.75
User,City,