Filter 依靠多个列上的GROUPBY并获取原始数据集

Filter 依靠多个列上的GROUPBY并获取原始数据集,filter,count,apache-pig,multiple-columns,Filter,Count,Apache Pig,Multiple Columns,剧本 2, cornflakes, Regular,General Mills, 12 3, cornflakes, Mixed Nuts, Post, 14 4, chocolate syrup, Regular, Hersheys, 5 5, chocolate syrup, No High Fructose, Hersheys, 8 6, chocolate syrup, Regular, Ghirardeli, 6 7, chocolate syrup, Str

剧本

2, cornflakes, Regular,General Mills, 12    
3, cornflakes, Mixed Nuts, Post, 14  
4, chocolate syrup, Regular, Hersheys, 5   
5, chocolate syrup, No High Fructose, Hersheys, 8  
6, chocolate syrup, Regular, Ghirardeli, 6  
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7

filter_数据将为您提供
巧克力糖浆,常规的
。将filter_数据与原始数据集中的项连接起来,键入并获得所需的结果

4, chocolate syrup, Regular, Hersheys, 5
6, chocolate syrup, Regular, Ghirardeli, 6
data\u grp=按(项目、类型)分组数据;
data_cnt=FOREACH data_grp生成展平(组)为(项目、类型),计数(数据)为总计;
filter_data=按总和<2过滤数据;
o_data=按(项目、类型)联接数据,按($0,$1)过滤数据;
最终数据=FOREACH o_数据生成$0..$4;
转储最终数据;

man,把你的问题写清楚……无法生成合理的计划。嵌套异常:org.apache.pig.backend.executionengine.ExecutionException:错误1070:无法使用导入解析计数:[,java.lang.,org.apache.pig.builtin.,org.apache.pig.impl.builtin.]计数应为大写
4, chocolate syrup, Regular, Hersheys, 5
6, chocolate syrup, Regular, Ghirardeli, 6
data_grp = GROUP data BY (item, type);
data_cnt = FOREACH data_grp GENERATE FLATTEN (group) AS(item, type), COUNT(data) as total; 
filter_data = FILTER data_cnt BY total < 2;
o_data = JOIN data BY (item,type),filter_data BY ($0,$1);
final_data = FOREACH o_data GENERATE $0..$4;
DUMP final_data;