Apache pig Pig中的高级透视表
我在StackOverflow上检查了另外两个Pig pivot问题,但没有成功。这有点不同 我想写一个通用的pivot函数,我不知道前面的模式。更糟糕的是,我需要以任意数量的列为轴心,并生成新的列,类似于ExcelPivot的工作方式。例如:Apache pig Pig中的高级透视表,apache-pig,Apache Pig,我在StackOverflow上检查了另外两个Pig pivot问题,但没有成功。这有点不同 我想写一个通用的pivot函数,我不知道前面的模式。更糟糕的是,我需要以任意数量的列为轴心,并生成新的列,类似于ExcelPivot的工作方式。例如: user year make model mileage ======================================= 123 2011 Ford Taurus 19.2 123 2011 Sub
user year make model mileage
=======================================
123 2011 Ford Taurus 19.2
123 2011 Subaru Forester 23.9
123 2012 Nissan Altima 25.6
123 2013 Ford Taurus 21.8
假设我想在本例中重点关注用户ID和年份:
user year Ford_Taurus_mileage Subaru_Forester_mileage Nissan_Altima_mileage
=================================================================================
123 2011 19.2 23.9
123 2012 25.6
123 2013 21.8
上面的Excel配置将是两行标签“用户”和“年份”,一个单值列“里程”,两列标签“制造商”和“型号”
我开始意识到这在Pig中可能是不可能的,但我想我会在这里发布以防万一。我曾考虑过让用户在UDF前面提供所有列,以便构建模式,但即使如此,我如何将所有行合并在一起?例如,在2011年,我们从两行合并到一行
任何帮助都将不胜感激。谢谢。尽管美学上有疑问,但这确实是可能的。Pig不知道您的模型和Make可能采用的所有不同值的名称,因此您必须进行分析并提取变量的级别 此脚本处理n个模型/品牌,并生成您请求的输出类型。要运行它,请键入pig-x local pivot.py或您决定命名该文件的任何名称(如果不是pivot.py),并将示例数据放在同一目录中
import collections
from org.apache.pig.scripting import *
input_path = 'tmp.txt' #Set to whatever your input path and filename are
#First, we run an embedded job to find all the distinct levels of model and make
find_distincts = """
A = LOAD '$INPUT' USING PigStorage() AS (user:chararray
, year:chararray
, make:chararray
, model:chararray
, mileage:chararray);
B = FOREACH A GENERATE make, model;
C = DISTINCT B;
DUMP C;
"""
P = Pig.compile(find_distincts)
output = P.bind({'INPUT':input_path}).runSingle()
#Gather the models and makes from the output of the Pig script
cars = []
CarRecord = collections.namedtuple('CarRecord', 'make model')
for x in output.result("C").iterator():
cars.append(CarRecord(make=x.get(0),model=x.get(1)))
#Next, we create a series of conditionals based off these distinct values
pivot_str = ""
cut_str = ""
#List of filters
for car in cars:
cut_str += "%s_%s_cut" % car + "= FOREACH A GENERATE (make == '%s' AND model == '%s'" % car + "?mileage:0) AS mileage;"
#Output schema for rows we grouped by
pivot_str += "GENERATE FLATTEN(group.user) AS user, FLATTEN(group.year) AS year"
#Output schema for columns we grouped by
for car in cars:
pivot_str += ', FLATTEN(%s_%s_cut.mileage)' % car + ' AS %s_%s_mileage' % car
pivot_str += ';'
#If you stopped the script here, it almost works--
#this approach yields duplicate records, so we have to enact a DISTINCT.
#It also produces every element of a (user,year) set, not just the
#intersection. To solve this, I sum the rows and keep only the greatest row.
sum_str = 'FOREACH C GENERATE user.., (%s_%s_mileage' % cars[0]
for car in cars[1:]:
sum_str += ' + %s_%s_mileage' % car
sum_str += ') AS user_year_sum;'
car_str = "%s_%s_mileage" % cars[0]
for car in cars[1:]:
car_str += ", %s_%s_mileage" % car
car_str += ';'
create_pivot = """
A = LOAD '$INPUT' USING PigStorage() AS (user:chararray
, year:chararray
, make:chararray
, model:chararray
, mileage:float);
B = FOREACH (GROUP A BY (user, year)){
%s
%s
};
C = DISTINCT B;
D = %s
E = GROUP D BY (user, year);
F = FOREACH E GENERATE group.user, group.year, MAX(D.user_year_sum) AS greatest;
G = JOIN F BY (user, year, greatest), D BY (user, year, user_year_sum);
out = FOREACH G GENERATE F::user AS user, F::year AS year, %s
rmf pivoted_results;
STORE out INTO 'pivoted_results';
DESCRIBE out;
""" % (cut_str,pivot_str,sum_str,car_str)
print create_pivot
create_pivot_P = Pig.compile(create_pivot)
output = create_pivot_P.bind({'INPUT':input_path}).runSingle()
使用示例输入输出:
123 2011 19.2 0.0 23.9
123 2012 0.0 25.6 0.0
123 2013 21.8 0.0 0.0
我假设您希望将空值设置为零,因为理论上不应该存在里程数为零的汽车
附录:除了我已经链接到的Pig文档之外,还有另一个很好的资源是Alan Gates的,在这篇文章中,它完全免费在线提供。我想我已经通过编写自定义存储函数解决了这个问题。用户提供类似Excel的参数列表:行标签、列标签和值。store函数然后使用checkSchema获取原始模式信息并将其存储在UDF上下文中。接下来,调用putNext时,将使用列标签和原始值列名创建新列。行标签刚刚写出来。保留一组新列名。每次写入一个值时,如果新列名的数量增加了,那么在删除旧模式后,我们会将新模式重写到磁盘上 对于值,通过putNext的两次迭代如下所示:
123,2011,19.2
123,2011,,23.9
123,2012,,,25.6
123,2013,21.8,,
对于模式,这是:
user,year,Ford_Taurus_mileage
user,year,Ford_Taurus_mileage,Subaru_Forester_mileage
user,year,Ford_Taurus_mileage,Subaru_Forester_mileage,Nissan_Altima_mileage
user,year,Ford_Taurus_mileage,Subaru_Forester_mileage,Nissan_Altima_mileage
最后一行的模式没有改变,因为输入是金牛座的另一个里程记录,我们已经有了一个列
在写出所有数据之后,我们可以使用新模式将其读回。不幸的是,这意味着要写出的第一个记录将丢失字段。请参见上述迭代中的前几行,因此我还必须重写LoadFunc方法getNext,以调用从PigStorage修改的新applySchema方法。如果元组中的字段数与模式中的字段数不匹配,这个新的applySchema方法会向元组添加null值。例如,在上面的示例中,StoreFunc编写的第一行是:
(123,2011,19.2)
但总体架构如下所示:
(user,year,Ford_Taurus_mileage,Subaru_Forester_mileage,Nissan_Altima_mileage)
(123,2011,19.2,,)
这意味着第一行缺少两个字段。使用新的LoadFunc,我们附加必要数量的空字段,使元组如下所示:
(user,year,Ford_Taurus_mileage,Subaru_Forester_mileage,Nissan_Altima_mileage)
(123,2011,19.2,,)
现在需要做的就是按用户和年份对重新加载的数据进行分组,并取其余3列的平均值来展平数据。不幸的是,嵌入式Pig仅在批处理模式下可用。我们需要在Grunt shell中以交互模式提供此功能,以便分析人员可以在任意列上透视他们希望的任何关系。我们不知道这些关系在前面会是什么样子,也不知道它们将以哪些列为轴心。我认为实现这一点的方法是编写一个MapReduce作业,然后使用MapReduce命令,但这意味着将所有数据存储在一个中间位置并读回,我也不确定是否可以为加载阶段生成一个模式。