Apache pig 猪拉丁语中的计数任务_Apache Pig

Apache pig 猪拉丁语中的计数任务

apache-pig

Apache pig 猪拉丁语中的计数任务,apache-pig,Apache Pig,假设我有一个夫妻列表（id，value）和一个潜在id列表对于每个potentialIdI，我想计算该ID出现在第一个列表中的次数例如我试着在PigLatin中这样做，但它看起来并不琐碎您能给我一些提示吗？总体计划是：您可以按id对夫妇进行分组，然后进行计数，然后在PotentialID上进行左连接，并输出计数。从那里你可以根据需要格式化它。代码应该更详细地解释如何做到这一点注意：如果您需要我更详细地介绍，请告诉我，但我认为这些评论应该能很好地解释正在发生的事情 -- B genera

假设我有一个夫妻列表

（id，value）

和一个

潜在id列表
对于每个potentialId
I，我想计算该ID
出现在第一个列表中的次数
例如
我试着在PigLatin
中这样做，但它看起来并不琐碎
您能给我一些提示吗？
总体计划是：您可以按id对夫妇进行分组，然后进行计数
，然后在PotentialID上进行左连接，并输出计数
。从那里你可以根据需要格式化它。代码应该更详细地解释如何做到这一点
注意：如果您需要我更详细地介绍，请告诉我，但我认为这些评论应该能很好地解释正在发生的事情
-- B generates the count of the number of occurrences of an id in couple
B = FOREACH (GROUP couples BY id) 
    -- Output and schema of the group is:
    -- {group: chararray,couples: {(id: chararray,value: chararray)}}
    -- (1,{(1,a),(1,x)})
    -- (2,{(2,y)})

    -- COUNT(couples) counts the number of tuples in the bag
    GENERATE group AS id, COUNT(couples) AS count ;

-- Now we want to do a LEFT join on potentialIDs and B since it will
-- create nulls for IDs that appear in potentialIDs, but not in B
C = FOREACH (JOIN potentialIDs BY id LEFT, B BY id) 
    -- The output and schema for the join is:
    -- {potentialIDs::id: chararray,B::id: chararray,B::count: long}
    -- (1,1,2)
    -- (2,2,1)
    -- (3,,)

    -- Now we pull out only one ID, and convert any NULLs in count to 0s
    GENERATE potentialIDs::id, (B::count is NULL?0:B::count) AS count ;

C
的模式和输出为：
C: {potentialIDs::id: chararray,count: long}
(1,2)
(2,1)
(3,0)

如果不需要C
中的（the:：），只需将GENERATE
行更改为：
GENERATE potentialIDs::id AS id, (B::count is NULL?0:B::count) AS count ;

GENERATE potentialIDs::id AS id, (B::count is NULL?0:B::count) AS count ;