Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/hadoop/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Hadoop 基于字段组合元组?_Hadoop_Apache Pig - Fatal编程技术网

Hadoop 基于字段组合元组?

Hadoop 基于字段组合元组?,hadoop,apache-pig,Hadoop,Apache Pig,假设我有一个像 {1001, {{id=1001, count=20, key=a}, {id=1001, count=30, key=b}}} {1002, {{id=1002, count=40, key=a}, {id=1001, count=50, key=b}}} 我想把它变成 {id=1001, a=20, b=30} {id=1002, a=40, b=50} 我可以使用什么Pig命令来执行此操作?不确定起始关系的格式,但对我来说它看起来像(int,bag:{tuple:(in

假设我有一个像

{1001, {{id=1001, count=20, key=a}, {id=1001, count=30, key=b}}}
{1002, {{id=1002, count=40, key=a}, {id=1001, count=50, key=b}}}
我想把它变成

{id=1001, a=20, b=30}
{id=1002, a=40, b=50}

我可以使用什么Pig命令来执行此操作?

不确定起始关系的格式,但对我来说它看起来像(int,bag:{tuple:(int,int,chararray)})?如果是这样,这应该是可行的:

flattened = FOREACH x GENERATE $0 AS id, flatten($1) AS (idx:int, count:int, key:chararray);
a = FILTER flattened BY key == 'a';
b = FILTER flattened BY key == 'b';
joined = JOIN a BY id, b BY id;
result = FOREACH joined GENERATE a::id AS id, a::count AS a, b::count AS b;

看起来您正在旋转,类似于。但是你已经有一袋元组了。进行内部联接将代价高昂,因为它将导致额外的Map-Reduce作业。要快速完成,需要在嵌套的foreach中进行过滤。修改后的代码将类似于:

inpt = load '..../pig/bag_pivot.txt' as (id : int, b:bag{tuple:(id : int, count : int, key : chararray)});

result = foreach inpt {
    col1 = filter b by key == 'a';
    col2 = filter b by key == 'b';
    generate id, flatten(col1.count) as a, flatten(col2.count) as b;
};
示例输入数据:

1001    {(1001,20,a),(1001,30,b)}
1002    {(1002,40,a),(1001,50,b)}
输出:

(1001,20,30)
(1002,40,50)

你能给你要转换的结构一个模式吗?我不认为你可以把一个袋子直接嵌套在另一个袋子里,除非里面的袋子被封装在一个元组里。