如何在ApachePig中加入地图?(存储在HBase中)

如何在ApachePig中加入地图?(存储在HBase中),hbase,apache-pig,Hbase,Apache Pig,我对ApachePig有一个问题,不知道如何解决,或者是否可能。我正在使用hbase作为“存储层”。该表如下所示: row key/column (b1, c1) (b2, c2) ... (bn, cn) a1 empty empty empty a2 ... an ... /* Loading the data

我对ApachePig有一个问题,不知道如何解决,或者是否可能。我正在使用hbase作为“存储层”。该表如下所示:

row key/column  (b1, c1)        (b2, c2)    ...     (bn, cn)
a1              empty           empty               empty   
a2              ...
an              ...         
/* Loading the data */
mydata = load 'hbase://mytable' ... as (a:chararray, b_c:map[]);

/* finding the right elements */ 
sub1 = FILTER mydata BY a == 'a1';
sub2 = FILTER mydata BY a == 'a2');
有行键a1到an,每行都有不同的列,其语法为(bn,cn)。每行/每列的值均为空

我的Pig程序如下所示:

row key/column  (b1, c1)        (b2, c2)    ...     (bn, cn)
a1              empty           empty               empty   
a2              ...
an              ...         
/* Loading the data */
mydata = load 'hbase://mytable' ... as (a:chararray, b_c:map[]);

/* finding the right elements */ 
sub1 = FILTER mydata BY a == 'a1';
sub2 = FILTER mydata BY a == 'a2');

现在我想连接sub1和sub2,这意味着我想查找数据sub1和sub2中存在的列。我该怎么做呢?

在纯pig中,“地图”将无法执行类似的操作。因此,您将需要一个UDF。我不确定您想要得到什么作为连接的输出,但是根据您的需要调整python UDF应该相当容易

myudf.py

@outputSchema('cols: {(col:chararray)}')
def join_maps(M1, M2):
    # This literally returns all column names that exist in both maps.
    out = []
    for k,v in M1.iteritems():
        if k in M2 and v is not None and M2[k] is not None:
            out.append(k)
    return out
您可以像这样使用它:

register 'myudf.py' using jython as myudf ;

# We can call sub2 from in sub1 since it only has one row
D = FOREACH sub1 GENERATE myudf.join_maps(b_c, sub2.b_c) ;

谢谢。我用Java构建了一个MapToBag UDF,这对我很有用。非常感谢。