Hadoop 清管器执行2个分拣袋的交叉_Hadoop_Apache Pig

Hadoop 清管器执行2个分拣袋的交叉

hadoop apache-pig

Hadoop 清管器执行2个分拣袋的交叉,hadoop,apache-pig,Hadoop,Apache Pig,我已经从HDFS加载了2个分类数据包。现在我想对它们执行合并连接或设置相交，以返回（3，风暴孤儿），（7，穆里尔的婚礼）作为结果我在使用datafu或pig mergejoin功能时遇到了一些问题我尝试了下面提到的朴素解决方案，但它并没有利用我的数据被排序的优势 vegas = LOAD 'vegas' USING PigStorage() AS (B1:bag{T1:tuple(id:int, name:chararray)}); macau = LOAD 'macau' USING Pi

我已经从HDFS加载了2个分类数据包。现在我想对它们执行合并连接或设置相交，以返回（3，风暴孤儿），（7，穆里尔的婚礼）作为结果

我在使用datafu或pig mergejoin功能时遇到了一些问题

我尝试了下面提到的朴素解决方案，但它并没有利用我的数据被排序的优势

vegas = LOAD 'vegas' USING PigStorage() AS (B1:bag{T1:tuple(id:int, name:chararray)});
macau = LOAD 'macau' USING PigStorage() AS (B2:bag{T2:tuple(id:int, name:chararray)});
vegast = FOREACH vegas GENERATE FLATTEN(vegas.$0) AS (id:int,name:chararray);
macaut = FOREACH hotel GENERATE FLATTEN(macau.$0) AS (id:int,name:chararray);

F = join vegast by id, macaut by id;
-- o/p: (3,Orphans of the Storm), (7,Muriel's Wedding)
-- describe vegas
--vegas: {B1: {T1: (id: int,name: chararray)}}
-- data for vegas
--({(3,Orphans of the Storm),(6,One Magic Christmas),(7,Muriel's Wedding),(8,Mother's Boys),(9,Nosferatu: Original Version)})

-- describe macau
--macau: {B1: {T1: (id: int,name: chararray)}}
--data for macau
--({(1,The Nightmare Before Christmas),(3,Orphans of the Storm),(4,The Object of Beauty),(7,Muriel's Wedding)})

有谁能建议一下，找到使用pig分拣的两个行李的交叉点的最佳方法是什么？

我们（Hadoop平台即服务）在行李上的set操作也有同样的问题，我们决定走简单的道路，在JRuby UDF中实现set操作

为了执行它，您需要在节点上安装jruby

请参见此处获取代码：

如果关系按联接字段排序，则可以合并联接它们

F = join vegast by id, macaut by id USING 'merge';

请参阅Pig文档中的更多内容：

如果有人在datafu或PigMergeJoin中获得SetCrossion，请提供提示

。如果您正在加载以下结构或在执行连接后实现了以下结构（请注意，必须对行李进行分拣）

我用COGROUP通过集合运算解决了这个问题。如果有人在datafu或PigMergeJoin中获得SetCrossion来工作，请提供提示。

DESCRIBE relationWith2Bags
relationWith2Bags: {B1: {(id: int,name: chararray)},B2: {(id: int,name: chararray)}}
--let it contain only 1 tuple with sorted bags from the question
--B1: {(3,Orphans of the Storm),(6,One Magic Christmas),(7,Muriel's Wedding),(8,Mother's Boys),(9,Nosferatu: Original Version)}
--B2: {(1,The Nightmare Before Christmas),(3,Orphans of the Storm),(4,The Object of Beauty),(7,Muriel's Wedding)}

intersect = FOREACH relationWith2Bags GENERATE datafu.pig.sets.SetIntersect(B1, B2);
DUMP intersect
--({(3,Orphans of the Storm),(7,Muriel's Wedding)})