Apache pig 清管器压扁与嵌套袋上的分组

Apache pig 清管器压扁与嵌套袋上的分组,apache-pig,Apache Pig,我在学猪,我知道书上可能有一个问题,但不幸的是我没有时间做研究。 我有两条管道: 备选案文1: a = LOAD 'geolocation' USING org.apache.hive.hcatalog.pig.HCatLoader(); b = LOAD 'truck_mileage' USING org.apache.hive.hcatalog.pig.HCatLoader(); c = join a by truckid, b by truckid; d = foreach c gene

我在学猪,我知道书上可能有一个问题,但不幸的是我没有时间做研究。 我有两条管道:

备选案文1:

a = LOAD 'geolocation' USING org.apache.hive.hcatalog.pig.HCatLoader();
b = LOAD 'truck_mileage' USING org.apache.hive.hcatalog.pig.HCatLoader();
c = join a by truckid, b by truckid; 
d = foreach c generate SUBSTRING(rdate,3,5) as year, event, mpg;
e = foreach d generate flatten(year) as year, event, mpg;
f = group e by year;
g = foreach f generate group, AVG(e.mpg);
x = limit g 10;
dump x;
我加载2个文件,然后加入,然后取日期的最后2位数字得到年份,在我使用扁平化简化事情之后,再分组得到平均mpg

备选案文2:

a = LOAD 'geolocation' USING org.apache.hive.hcatalog.pig.HCatLoader();
b = LOAD 'truck_mileage' USING org.apache.hive.hcatalog.pig.HCatLoader();
c = join a by truckid, b by truckid; 
d = foreach c generate SUBSTRING(rdate,3,5) as year, event, mpg;
f = group d by year;
g = foreach f generate group, AVG(d.mpg);
x = limit g 10;
dump x;
同样的事情,但我不使用“展平”来分组,然后得到mpg的平均值

我得到了相同的结果,但是,有显著的差异吗?在本例中,我使用的数据集不大,但我很好奇,如果我有数百万条记录,情况会如何


谢谢

数据集变大时,哪个参数的性能有显著差异。