Hadoop 使用pig按列合并两个文件_Hadoop_Apache Pig

Hadoop 使用pig按列合并两个文件

hadoop apache-pig

Hadoop 使用pig按列合并两个文件,hadoop,apache-pig,Hadoop,Apache Pig,我想使用pig合并/合并两个文件。但是，这是一个不同于普通工会的工会。以下是我的文件（h*是文件头）：结果输出必须是这些文件的并集，如下所示： FR : h1,h2,h3,h4,h5,h6 a01,a02,a03,a04,, a11,a12,a13,a14,, ,,a23,a24,b01,b02 ,,a33,a34,b11,b12 另一个困难是我想使它通用，这样它就可以处理公共列的动态数量。目前有两个公共列，它可能有3个或1个公共列，甚至根本没有公共列。例如： F1 : h1,h2,h3,

我想使用pig合并/合并两个文件。但是，这是一个不同于普通工会的工会。以下是我的文件（h*是文件头）：

结果输出必须是这些文件的并集，如下所示：

FR :
h1,h2,h3,h4,h5,h6 
a01,a02,a03,a04,,
a11,a12,a13,a14,,
,,a23,a24,b01,b02
,,a33,a34,b11,b12

另一个困难是我想使它通用，这样它就可以处理公共列的动态数量。目前有两个公共列，它可能有3个或1个公共列，甚至根本没有公共列。例如：

F1 :
h1,h2,h3,h4
a1,a2,a3,a4

F2
h5,h6,h7,h8
b1,b2,b3,b4

FR
a1,a2,a3,a4
,,,,b1,b2,b3,b4

任何提示/帮助都是值得注意的

以下是静态操作的方法：

F1full = FOREACH F1 GENERATE h1,h2,h3,h4, NULL as h5, NULL as h6;
F2full = FOREACH F2 GENERATE NULL as h1,NULL as h2,h3,h4, h5, h6;

FR = F1full UNION F2full;

Pig不是很灵活，所以我认为不可能动态地/针对一般情况生成它

如果您想要通用情况的解决方案，可以使用python之类的语言基于存储的表/文件的元数据构建所需的命令。

以下是静态执行的方法：

F1full = FOREACH F1 GENERATE h1,h2,h3,h4, NULL as h5, NULL as h6;
F2full = FOREACH F2 GENERATE NULL as h1,NULL as h2,h3,h4, h5, h6;

FR = F1full UNION F2full;

Pig不是很灵活，所以我认为不可能动态地/针对一般情况生成它

如果您想要通用情况的解决方案，可以使用类似python的语言基于存储表/文件的元数据构建所需的命令。

我尝试使用以下方法解决此问题：

1) Load both of the files. 
2) Add counter to generate a unique field (ID).
3) Start the counter for file B where counter for A ended.
4) Cogroup both files with common columns, including counteer.
5) Take all group columns in a different schema.
6) Generate uncommon columns from both files, along with the counter.
7) First join uncommon columns from file A with group columns on counter.
8) Join the result of step 7 with uncommon columns from file B on counter.

下面是执行相同操作的pig脚本。由于这个脚本是通用的，我已经提到了在运行脚本之前需要哪些参数

-- Parameters required : $file1_path, $file2_path, $file1_schema, $file2_schema, $COUNT_A (number of rows in file A), $CMN_COLUMN_A (common columns in A), $CMN_COLUMN_B, $UNCMN_COLUMN_A(Unique columns in file A), $UNCMN_COLUMN_B.
A = LOAD '$file1_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~',  'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file1_schema);
B = LOAD '$file2_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~',  'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file2_schema);

RANK_A = RANK A;
RANK_B = RANK B;
COUNT_RANK_B = FOREACH RANK_B GENERATE ($0+(long)'$COUNT_A') as rank_B, $1 ..;

COGRP_RANK_AB = COGROUP RANK_A BY($CMN_COLUMN_A), COUNT_RANK_B BY ($CMN_COLUMN_B);

CMN_COGRP_RANK_AB = FOREACH COGRP_RANK_AB GENERATE FLATTEN(group) AS ($CMN_COLUMN_A);
UNCMN_RB = FOREACH COUNT_RANK_B GENERATE $UNCMN_COLUMN_B;

JOIN_CMN_UNCMN_A = JOIN CMN_COGRP_RANK_AB BY(rank_A) LEFT OUTER, UNCMN_RA by rank_A;
JOIN_CMN_UNCMN_B = JOIN JOIN_CMN_UNCMN_A BY(CMN_COGRP_RANK_AB::rank_A) LEFT OUTER, UNCMN_RB by rank_B;

STORE FINAL_DATA INTO '$store_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');

我尝试使用以下方法解决问题：

1) Load both of the files. 
2) Add counter to generate a unique field (ID).
3) Start the counter for file B where counter for A ended.
4) Cogroup both files with common columns, including counteer.
5) Take all group columns in a different schema.
6) Generate uncommon columns from both files, along with the counter.
7) First join uncommon columns from file A with group columns on counter.
8) Join the result of step 7 with uncommon columns from file B on counter.

下面是执行相同操作的pig脚本。由于这个脚本是通用的，我已经提到了在运行脚本之前需要哪些参数

-- Parameters required : $file1_path, $file2_path, $file1_schema, $file2_schema, $COUNT_A (number of rows in file A), $CMN_COLUMN_A (common columns in A), $CMN_COLUMN_B, $UNCMN_COLUMN_A(Unique columns in file A), $UNCMN_COLUMN_B.
A = LOAD '$file1_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~',  'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file1_schema);
B = LOAD '$file2_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~',  'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file2_schema);

RANK_A = RANK A;
RANK_B = RANK B;
COUNT_RANK_B = FOREACH RANK_B GENERATE ($0+(long)'$COUNT_A') as rank_B, $1 ..;

COGRP_RANK_AB = COGROUP RANK_A BY($CMN_COLUMN_A), COUNT_RANK_B BY ($CMN_COLUMN_B);

CMN_COGRP_RANK_AB = FOREACH COGRP_RANK_AB GENERATE FLATTEN(group) AS ($CMN_COLUMN_A);
UNCMN_RB = FOREACH COUNT_RANK_B GENERATE $UNCMN_COLUMN_B;

JOIN_CMN_UNCMN_A = JOIN CMN_COGRP_RANK_AB BY(rank_A) LEFT OUTER, UNCMN_RA by rank_A;
JOIN_CMN_UNCMN_B = JOIN JOIN_CMN_UNCMN_A BY(CMN_COGRP_RANK_AB::rank_A) LEFT OUTER, UNCMN_RB by rank_B;

STORE FINAL_DATA INTO '$store_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');

这是否需要手动指定每个表的唯一列和公共列？或者它提供了一种编程方式来提供输入变量。我有一个脚本的包装器，可以通过编程方式找到所有变量。我正在使用相同的脚本合并n个文件，因此无法手动提供所有文件。这是否需要手动指定每个表的唯一列和公用列？或者它提供了一种编程方式来提供输入变量。我有一个脚本的包装器，可以通过编程方式找到所有变量。我使用相同的脚本来联合n个文件，因此我无法手动提供所有文件。