Apache pig 清管器-计算

Apache pig 清管器-计算,apache-pig,Apache Pig,我在Pig中有一个数据集,如下所示: 6009544 "NY" 6009545 "NY" 6009544 "NY" 6009545 "NY" 6009548 "NY" 6009546 "OR" 6009546 "OR" 6009546 "OR" 6009545 "NY" 6009546 "OR" 6009548 "NY" 6009547 "AZ" 6009547 "AZ" 6009547 "AZ" 6009547 "AZ" 6009548 "NY"

我在Pig中有一个数据集,如下所示:

6009544 "NY"    6009545 "NY"
6009544 "NY"    6009545 "NY"
6009548 "NY"    6009546 "OR"
6009546 "OR"    6009546 "OR"
6009545 "NY"    6009546 "OR"
6009548 "NY"    6009547 "AZ"
6009547 "AZ"    6009547 "AZ"
6009547 "AZ"    6009548 "NY"
6009544 "NY"    6009548 "NY"
第一行是这样读的:“6009544号专利起源于纽约,并引用6009545号专利起源于纽约。”我试图找出每个州引用的来自同一州的专利的百分比。所以我的预期输出应该是

NY: .5
OR: 1
AZ: .5
因为有6项专利起源于纽约,有3项引用了同样起源于纽约的专利。起源于俄勒冈州的第一项专利引用了同样起源于纽约的一项专利。在起源于亚利桑那州的两项专利中,有一项引用了同样起源于亚利桑那州的专利

有人能推荐一种在Pig中执行此操作的好方法吗?

你能试试吗

input.txt
6009544 "NY"    6009545 "NY"
6009544 "NY"    6009545 "NY"
6009548 "NY"    6009546 "OR"
6009546 "OR"    6009546 "OR"
6009545 "NY"    6009546 "OR"
6009548 "NY"    6009547 "AZ"
6009547 "AZ"    6009547 "AZ"
6009547 "AZ"    6009548 "NY"
6009544 "NY"    6009548 "NY"

PigScript:
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(\\d+)\\s+"(\\w+)"\\s+(\\d+)\\s+"(\\w+)"')) AS (f1:int,f2:chararray,f3:int,f4:chararray);
C = GROUP B BY f2;
D = FOREACH C {
                FilterByPatent = FILTER B BY f2==f4;
                CityPatentCount = COUNT(B.f2);
                GENERATE group,((float)COUNT(FilterByPatent)/(float)CityPatentCount);
              }
DUMP D;

Output:
(AZ,0.5)
(NY,0.5)
(OR,1.0)

我更改示例数据,并使用空格分隔数据:

A = load '/padata' using PigStorage(' ' ) as (pno:int,pcity:chararray,pci:int,pccity:chararray);

b = group A by pcity ;

r = foreach b {

               copcity= COUNT(A.pcity) ;

               samdata = FILTER A by pcity==pccity;

               csamdata = COUNT(samdata);

               percent = (float)csamdata/(float)copcity;

               generate group,percent ;

               }

dump r ; 
输出:-

("AZ",0.5)

("NY",0.5)

("OR",1.0)