Hadoop 在一个小组里数猪
假设我有一个关系Hadoop 在一个小组里数猪,hadoop,apache-pig,Hadoop,Apache Pig,假设我有一个关系学生,字段为成绩和老师。我想按年级和老师分组,但保留每组每个年级的所有学生人数。比如: classes = GROUP Students BY (grade,teacher); classes = FOREACH classes { GENERATE (### COUNT OF ALL STUDENTS IN GRADE ###) as grade_size, Students as students, teacher as teache
学生
,字段为成绩
和老师
。我想按年级和老师分组,但保留每组每个年级的所有学生人数。比如:
classes = GROUP Students BY (grade,teacher);
classes = FOREACH classes {
GENERATE
(### COUNT OF ALL STUDENTS IN GRADE ###) as grade_size,
Students as students,
teacher as teacher;
}
但是我不知道如何从group语句内部进行过滤。某种过滤器,但我不知道如何界定小组内外学生的分数 有两种方法: 1) 使用按年级和老师分组,比计数、比展平和按年级分组
classes = GROUP Students BY (grade,teacher);
teachers = FOREACH classes GENEARATE FLATTEN(group) as (grade,teacher), COUNT(Students) as perTeacehr;
grade = GROUP teachers BY grade;
result = FOREACH grade GENERATE FLATTEN(teachers), SUM(teachers.perTeacher) as perGrade;
describe result;
dump result;
2) 按级别分组,而不是使用DataFu库中的UDF在内存中执行分组,但这很容易受到堆内存异常的影响,但速度更快。删除了sql标记,因为这是关于Pig的。示例输入和输出将有助于理解您的问题。