Apache pig 扁平化或子串与连接清管器混乱
我有两个数据集,Apache pig 扁平化或子串与连接清管器混乱,apache-pig,Apache Pig,我有两个数据集,data1和data2 data2包含以下数据 a1:u:11#eve:f:6 a1:u:12#eve:f:6 a1:u2:13#eve:f:3 a1:u1:12#eve:s:6 a1:u1:11#eve:f:6 这里的:和#都是分隔符。我最终生成data2,如下所示: LOAD '$data2' USING PigStorage(':') AS (ad: chararray, a_id: chararray
data1
和data2
data2
包含以下数据
a1:u:11#eve:f:6
a1:u:12#eve:f:6
a1:u2:13#eve:f:3
a1:u1:12#eve:s:6
a1:u1:11#eve:f:6
这里的:
和#
都是分隔符。我最终生成data2
,如下所示:
LOAD '$data2' USING PigStorage(':') AS
(ad: chararray,
a_id: chararray,
cid_eve1: chararray,
name: chararray,
len: int);
然后我把第三列一分为二
FOREACH data2 GENERATE
ad AS ad,
a_id AS a_id,
FLATTEN(STRSPLIT(cid_eve1, '#')) AS (cid: int, eve1: chararray),
name AS name,
len AS len;
现在,当我加入data2
和data1
时,我什么也得不到
我也试过,
FOREACH data2 GENERATE
ad AS ad,
a_id AS a_id,
SUBSTRING(cid_eve1,0,INDEXOF(cid_eve1,'#',0)) AS cid: int,
name AS name,
len AS len;
加入时也不会返回任何内容。我将加入第三栏,cid
我甚至为这两种情况转储了data2
,并看到了输出。这是人们所期望的。但当我将以下文件用作data2
时
a1:u:11:eve:f:6
a1:u:12:eve:f:6
a1:u2:13:eve:f:3
a1:u1:12:eve:s:6
a1:u1:11:eve:f:6
并加载为
LOAD '$data2' USING PigStorage(':') AS
(ad: chararray,
a_id: chararray,
cid: int,
eve1: chararray,
name: chararray,
len: int);
然后连接返回正确的结果。我不知道为什么会这样。有人能帮忙或给点建议吗
data1
,第二列($1
)是a_id
,最后一列是cid
。他们两人都参加
1,u,true,true,4,1,1,1,1,1,11,21,31,11
1,u,true,true,4,1,1,1,1,1,11,21,32,11
1,u,true,true,4,1,1,1,1,1,11,21,33,11
1,u,true,true,4,1,1,1,1,1,11,21,31,11
1,u,true,true,4,1,1,1,1,1,11,21,32,11
1,u,true,true,4,1,1,1,1,1,11,21,33,11
2,u,true,true,4,1,1,1,1,1,12,22,34,12
2,u,true,true,4,1,1,1,1,1,13,22,35,13
2,u1,true,false,4,1,1,1,1,0,12,22,34,12
2,u1,true,false,4,1,1,1,1,0,13,22,35,13
2,u1,true,true,9,1,1,1,1,1,12,22,34,12
2,u1,true,true,9,1,1,1,1,1,13,22,35,13
3,u,false,false,4,1,0,1,0,0,14,24,31,14
3,u,false,false,4,1,0,1,0,0,11,22,31,11
4,u,true,NULL,0,1,1,0,0,0,11,22,33,11
4,u1,false,NULL,0,1,0,0,0,0,11,22,33,11
2,u,true,true,4,1,1,1,1,1,12,22,34,12
2,u,true,true,4,1,1,1,1,1,13,22,35,13
2,u2,true,true,7,1,1,1,1,1,12,22,34,12
2,u2,true,true,7,1,1,1,1,1,13,22,35,13
我找到了答案。问题在于数据类型。我试图把
chararray
读入int
,但没有打字
当我把它改成
FOREACH data2 GENERATE
ad AS ad,
a_id AS a_id,
(int)SUBSTRING(cid_eve1,0,INDEXOF(cid_eve1,'#',0)) AS cid,
name AS name,
len AS len;
成功了