Hadoop 相当于linux';差异';阿帕奇猪
我希望能够在两个大文件上执行标准差异。我有一些可以工作的东西,但是它没有命令行上的diff那么快Hadoop 相当于linux';差异';阿帕奇猪,hadoop,apache-pig,diff,Hadoop,Apache Pig,Diff,我希望能够在两个大文件上执行标准差异。我有一些可以工作的东西,但是它没有命令行上的diff那么快 A = load 'A' as (line); B = load 'B' as (line); JOINED = join A by line full outer, B by line; DIFF = FILTER JOINED by A::line is null or B::line is null; DIFF2 = FOREACH DIFF GENERATE (A::line is null
A = load 'A' as (line);
B = load 'B' as (line);
JOINED = join A by line full outer, B by line;
DIFF = FILTER JOINED by A::line is null or B::line is null;
DIFF2 = FOREACH DIFF GENERATE (A::line is null?B::line : A::line), (A::line is null?'REMOVED':'ADDED');
STORE DIFF2 into 'diff';
有没有更好的方法呢?我使用以下方法。(我的连接方法非常类似,但此方法不使用复制行复制diff的行为)。正如前一段时间所问的,也许您只使用一个异径管作为清管器来调整0.8中的异径管数量
- 我使用的两种方法的性能相差不到几个百分点,但处理重复项的方法并不相同
- 联接方法折叠重复项(因此,如果一个文件的重复项多于另一个,此方法将不输出重复项)
- UNION方法的工作原理与Unix
(1)工具类似,它将为正确的文件返回正确数量的额外副本diff
- 与Unix
(1)工具不同,顺序并不重要(连接方法有效地执行diff
,而UNION执行sort-u | diff
sort | diff)
- 如果您有数量惊人的(~数千)个重复行,那么由于连接,速度会减慢(如果您的使用允许,请先对原始数据执行DISTINCT)
- 如果您的行非常长(例如大小>1KB),则建议使用UDF和仅差于散列,然后与原始文件合并,以在输出之前返回原始行
- 使用18个节点的LZO压缩输入,在200GB(1055687930行)上进行差分大约需要10分钟
- 每种方法只需要一个Map/Reduce周期
- 这导致每个节点每分钟大约有1.8GB的差异(不是很大的吞吐量,但在我的系统上,它似乎只在内存中运行,而Hadoop利用了流式磁盘
SET job.name 'Diff(1) Via Join'
-- Erase Outputs
rmf first_only
rmf second_only
-- Process Inputs
a = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS First: chararray;
b = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Second: chararray;
-- Combine Data
combined = JOIN a BY First FULL OUTER, b BY Second;
-- Output Data
SPLIT combined INTO first_raw IF Second IS NULL,
second_raw IF First IS NULL;
first_only = FOREACH first_raw GENERATE First;
second_only = FOREACH second_raw GENERATE Second;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();
SET job.name 'Diff(1)'
-- Erase Outputs
rmf first_only
rmf second_only
-- Process Inputs
a_raw = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
b_raw = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
a_tagged = FOREACH a_raw GENERATE Row, (int)1 AS File;
b_tagged = FOREACH b_raw GENERATE Row, (int)2 AS File;
-- Combine Data
combined = UNION a_tagged, b_tagged;
c_group = GROUP combined BY Row;
-- Find Unique Lines
%declare NULL_BAG 'TOBAG(((chararray)\'place_holder\',(int)0))'
counts = FOREACH c_group {
firsts = FILTER combined BY File == 1;
seconds = FILTER combined BY File == 2;
GENERATE
FLATTEN(
(COUNT(firsts) - COUNT(seconds) == (long)0 ? $NULL_BAG :
(COUNT(firsts) - COUNT(seconds) > 0 ?
TOP((int)(COUNT(firsts) - COUNT(seconds)), 0, firsts) :
TOP((int)(COUNT(seconds) - COUNT(firsts)), 0, seconds))
)
) AS (Row, File); };
-- Output Data
SPLIT counts INTO first_only_raw IF File == 1,
second_only_raw IF File == 2;
first_only = FOREACH first_only_raw GENERATE Row;
second_only = FOREACH second_only_raw GENERATE Row;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();