Sql PIG查询以透视具有相同行数的行和列

Sql PIG查询以透视具有相同行数的行和列,sql,hadoop,apache-pig,Sql,Hadoop,Apache Pig,尝试创建sql或PIG查询,这些查询将根据类型生成不同值的计数结果 换言之,鉴于此表: Type: Value: A x B y C y B y C z A x A z A z A x B x B z B x C x 我希望得到以下结果: Type: x: y: z: A 3

尝试创建sql或PIG查询,这些查询将根据类型生成不同值的计数结果

换言之,鉴于此表:

Type:    Value:
A        x
B        y
C        y
B        y
C        z
A        x
A        z
A        z
A        x
B        x
B        z
B        x
C        x
我希望得到以下结果:

Type:    x:    y:    z:
A         3     0     2
B         2     2     1
C         1     1     1
此外,作为结果的平均值表也会有所帮助

Type:    x:    y:    z:
A         0.60  0.00  0.40
B         0.40  0.40  0.20 
C         0.33  0.33  0.33
编辑4

我是一个在猪nooby,但阅读8个不同的堆栈溢出帖子,我想出了这个

当我使用这个PIG查询时

A = LOAD 'tablex' USING org.apache.hcatalog.pig.HCatLoader();
x = foreach A GENERATE id_orig_h;
xx = distinct x;
y = foreach A GENERATE id_resp_h;
yy = distinct y;
yyy = group yy all;
zz = GROUP A BY (id_orig_h, id_resp_h);
B = CROSS xx, yy;
C = foreach B generate xx::id_orig_h as id_orig_h, yy::id_resp_h as id_resp_h;
D = foreach zz GENERATE flatten (group) as (id_orig_h, id_resp_h), COUNT(A) as count;
E = JOIN C by (id_orig_h, id_resp_h) LEFT OUTER, D BY (id_orig_h, id_resp_h);
F = foreach E generate C::id_orig_h as id_orig_h, C::id_resp_h as id_resp_h, D::count as count;
G = foreach yyy generate 0 as id:chararray, flatten(BagToTuple(yy));
H = group F by id_orig_h;
I = foreach H generate group as id_orig_h, flatten(BagToTuple(F.count)) as count;
dump G;
dump I;
这类作品

我明白了:

(0,x,y,z)
(A,3,0,2)
(B,2,2,1)
(C,1,1,1)

我可以将其导入文本文件,并将“(“and”)”作为CSV使用,模式为第一行。这种工作方式非常慢。我想要一个更好、更快、更干净的方法。如果有人知道一种方法,请告诉我。

我所能想到的最好方法是只使用Oracle,尽管它不会为每个值提供一列,但它会向您提供如下数据:

A   x=3,y=3,z=3
B   x=4,y=3
C   y=3,z=2
当然,如果您有900个值,它将显示:

A  x=3,y=6,...,ff=12 
等等

我无法添加评论,因此我无法询问您oracle是否可以。不管怎样,下面的查询可以实现这一点:

SELECT type, values FROM 
(SELECT type, SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || OCC, ','),2) values, seq, 
MAX(seq) OVER (partition by type) max
FROM
(SELECT type, value, OCC, ROW_NUMBER () OVER (partition by type ORDER BY type, value) seq
FROM
(SELECT type, value, COUNT(*) OCC
FROM tableName
GROUP BY type, value))
START WITH seq=1
CONNECT by PRIOR
  seq+1=seq
  AND PRIOR 
    type=type)
WHERE seq = max;
对于需要在所有其他信息之前添加信息的平均值,代码如下:

SELECT * FROM 
(SELECT type, 
SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || OCC, ','),2) values,
SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || (OCC / TOT), ','),2) average, 
seq, MAX(seq) OVER (partition by type) max
FROM
(SELECT type, value, TOT, OCC, ROW_NUMBER () OVER (partition by type ORDER BY type, value) seq
FROM
(
SELECT type, value, TOT, COUNT(*) OCC
FROM (SELECT type, value, COUNT(*) OVER (partition by type) TOT
FROM tableName)
GROUP BY type, value, TOT
))
START WITH seq=1
CONNECT by PRIOR
  seq+1=seq
  AND PRIOR 
    type=type)
WHERE seq = max;

根据相关编辑#3更新代码:

A = load '/path/to/input/file' using AvroStorage();
B = group A by (type, value);
C = foreach B generate flatten(group) as (type, value), COUNT(A) as count;

-- Now get all the values.
M = foreach A generate value;

-- Left Outer Join all the values with C, so that every type has exactly same number of values associated
N = join M by value left outer, C by value;
O = foreach N generate 
                  C::type as type, 
                  M::value as value, 
                  (C::count == null ? 0 : C::count) as count; --count = 0 means value was not associated with the type
P = group O by type;
Q = foreach P {
                  R = order O by value asc;  --Ordered by value, so values counts are ordered consistently in all the rows.
                  generate group as type, flatten(R.count);
              }

请注意,我没有执行上面的代码。这些只是代表性的步骤。

你可以使用BrkHooWS中的向量操作UDF来考虑,每个值都是非常高维空间中的一个维度。可以将单个值实例解释为该维度中的向量,值为1。在配置单元中,我们将把这样一个向量简单地表示为一个以字符串为键、以int或其他数字为值的映射

您要创建的是一个向量,它是所有向量的总和,按类型分组。问题是:

SELECT type, 
  union_vector_sum( map( value, 1 ) ) as vector,
FROM table
GROUP BY type;
Brickhouse甚至有一个规格化函数,它将生成您的“平均值”

SELECT type, 
  vector_normalize(union_vector_sum( map( value, 1 ) ))
     as normalized_vector,
FROM table
GROUP BY type;

你知道值的列表吗?第一个表是很好的表示,但是值可能每天都不同。它不总是x、y、z,可能是3个值或900。建议-请更新问题标题。这对其他人来说并不清楚,也没有帮助。改变的建议?我现在不使用oracle,我使用的是hadoop hive。不管怎样,我还是尝试了这个查询,以防万一,我得到一个错误,说-子查询不受支持。经过进一步调查,我发现Hive不支持FROM子句之外的子查询。我目前只能访问HIVE或PIG。SYS\u CONNECT\u BY\u PATH仅在oracle中提供(这是使一切成为可能的功能)。我可以检查一下蜂箱里是否有类似的东西,但我不是它的专家,谢谢你的帮助,我出了一个错误,我不能把它贴在这里,因为它看起来很时髦。我编辑了我的原始帖子,请看一下我对你的评论。Gaurav,我设法让其中一些工作,但第三行的“组作为(类型,值)”部分正在破坏它。当我omot“as(type,value)”时,我让它与修改为“C=foreach B generate group,COUNT(A);”的第三行一起工作。这将返回一个类似((类型、值)、计数)的表。我做错了什么?对不起,我的代码有错误。现在修好了。具体来说,它应该是扁平的(组),而不是第三行的组。希望这有帮助。高拉夫,这很接近,但仍然不是我所需要的。请看一下我上次的编辑。我已经相应地更新了代码。如果有帮助,请告诉我。Jerome,我试过了,我花了一些时间让它全部工作,但我最终能够运行查询。当我运行它时,我得到了一个接近我想要的结果,但是有没有一种方法可以代替向量作为列,让值成为列,每个计数都在下面?哇,太好了。要从向量中生成列,请使用“map_index”UDF。当然,您需要知道前面的列是什么,因为配置单元表的所有行都有相同的列。
A = LOAD 'tablex' USING org.apache.hcatalog.pig.HCatLoader();
x = foreach A GENERATE id_orig_h;
xx = distinct x;
y = foreach A GENERATE id_resp_h;
yy = distinct y;
yyy = group yy all;
zz = GROUP A BY (id_orig_h, id_resp_h);
B = CROSS xx, yy;
C = foreach B generate xx::id_orig_h as id_orig_h, yy::id_resp_h as id_resp_h;
D = foreach zz GENERATE flatten (group) as (id_orig_h, id_resp_h), COUNT(A) as count;
E = JOIN C by (id_orig_h, id_resp_h) LEFT OUTER, D BY (id_orig_h, id_resp_h);
F = foreach E generate C::id_orig_h as id_orig_h, C::id_resp_h as id_resp_h, D::count as count;
G = foreach yyy generate 0 as id:chararray, flatten(BagToTuple(yy));
H = group F by id_orig_h;
I = foreach H generate group as id_orig_h, flatten(BagToTuple(F.count)) as count;
dump G;
dump I;