Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/hadoop/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Hadoop 以拉丁语处理重复记录_Hadoop_Apache Pig - Fatal编程技术网

Hadoop 以拉丁语处理重复记录

Hadoop 以拉丁语处理重复记录,hadoop,apache-pig,Hadoop,Apache Pig,如果文件中存在重复项,则第一条记录应转到有效文件,其余重复记录应使用PIG脚本移动到无效文件 下面是场景 Input: Acc|Phone|Name 1234|333-444-5555|XYZ 4567|222-555-1111|ABC 1234|234-123-0000|DEF 9999|123-456-1890|PQR 8734|456-879-1234|QWE 4567|369-258-0147|NNN 1234|987-654-3210|BLS output: Two files 1

如果文件中存在重复项,则第一条记录应转到有效文件,其余重复记录应使用PIG脚本移动到无效文件

下面是场景

Input:
Acc|Phone|Name
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
1234|234-123-0000|DEF
9999|123-456-1890|PQR
8734|456-879-1234|QWE
4567|369-258-0147|NNN
1234|987-654-3210|BLS

output: Two files

1. Valid rec:
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
9999|123-456-1890|PQR
8734|456-879-1234|QWE

2. Invalid rec:
1234|234-123-0000|DEF
4567|369-258-0147|NNN
1234|987-654-3210|BLS
无效记录的顺序不一定相同。也可以是这样

Invalid rec:
1234|234-123-0000|DEF
1234|987-654-3210|BLS
4567|369-258-0147|NNN
情景2: 输入:

好记录:

1234|333-444-5555|XYZ
4567|222-555-1111|ABC
9999|123-456-1890|PQR
8734|456-879-1234|QWE
谁能提出一些建议吗。我只能得到第一张唱片

谢谢

你能试试这个吗

input.txt

1234|333-444-5555|XYZ
4567|222-555-1111|ABC
1234|234-123-0000|DEF
9999|123-456-1890|PQR
8734|456-879-1234|QWE
4567|369-258-0147|NNN
1234|987-654-3210|BLS
PigScript:

A =LOAD 'input.txt' USING PigStorage('|') AS (Acc:chararray,Phone:chararray,Name:chararray);
B = RANK A;
C = GROUP B BY Acc;
D = FOREACH C {
                sortInAsc = ORDER B BY rank_A ASC;
                top1 = LIMIT sortInAsc 1;
                GENERATE top1 AS goodRecord,SUBTRACT(B,top1) AS badRecord;
              }

--Flatten the good records
E = FOREACH D GENERATE FLATTEN(goodRecord);

--Get the required columns and skip the rank column(ie,$0)
F = FOREACH E GENERATE $1,$2,$3;
STORE F INTO 'goodrecord' USING PigStorage('|');


--Flatten the bad records
G = FOREACH D GENERATE FLATTEN(badRecord);

--Get the required columns and skip the rank column(ie,$0)
H = FOREACH G GENERATE $1,$2,$3;
STORE H INTO 'badrecord' USING PigStorage('|');
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|987-654-3210|BLS
1234|234-123-0000|DEF
4567|369-258-0147|NNN
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|033-444-5555|XYZ
1234|007-654-3210|BLS
1234|230-123-0000|DEF
1234|303-444-5555|XYZ
1234|234-123-0000|DEF
1234|134-123-0000|DEF
1234|086-654-3210|BLS
1234|087-654-3210|BLS
4567|369-258-0147|NNN
4567|309-258-0147|NNN
4567|122-555-1111|ABC
4567|069-258-0147|NNN
4567|200-555-1111|ABC
8734|456-879-1234|QWE
8734|456-779-1234|QWE
9999|123-456-1890|PQR
9999|023-456-1890|PQR
goodrecord输出1:

A =LOAD 'input.txt' USING PigStorage('|') AS (Acc:chararray,Phone:chararray,Name:chararray);
B = RANK A;
C = GROUP B BY Acc;
D = FOREACH C {
                sortInAsc = ORDER B BY rank_A ASC;
                top1 = LIMIT sortInAsc 1;
                GENERATE top1 AS goodRecord,SUBTRACT(B,top1) AS badRecord;
              }

--Flatten the good records
E = FOREACH D GENERATE FLATTEN(goodRecord);

--Get the required columns and skip the rank column(ie,$0)
F = FOREACH E GENERATE $1,$2,$3;
STORE F INTO 'goodrecord' USING PigStorage('|');


--Flatten the bad records
G = FOREACH D GENERATE FLATTEN(badRecord);

--Get the required columns and skip the rank column(ie,$0)
H = FOREACH G GENERATE $1,$2,$3;
STORE H INTO 'badrecord' USING PigStorage('|');
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|987-654-3210|BLS
1234|234-123-0000|DEF
4567|369-258-0147|NNN
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|033-444-5555|XYZ
1234|007-654-3210|BLS
1234|230-123-0000|DEF
1234|303-444-5555|XYZ
1234|234-123-0000|DEF
1234|134-123-0000|DEF
1234|086-654-3210|BLS
1234|087-654-3210|BLS
4567|369-258-0147|NNN
4567|309-258-0147|NNN
4567|122-555-1111|ABC
4567|069-258-0147|NNN
4567|200-555-1111|ABC
8734|456-879-1234|QWE
8734|456-779-1234|QWE
9999|123-456-1890|PQR
9999|023-456-1890|PQR
不良记录输出1:

A =LOAD 'input.txt' USING PigStorage('|') AS (Acc:chararray,Phone:chararray,Name:chararray);
B = RANK A;
C = GROUP B BY Acc;
D = FOREACH C {
                sortInAsc = ORDER B BY rank_A ASC;
                top1 = LIMIT sortInAsc 1;
                GENERATE top1 AS goodRecord,SUBTRACT(B,top1) AS badRecord;
              }

--Flatten the good records
E = FOREACH D GENERATE FLATTEN(goodRecord);

--Get the required columns and skip the rank column(ie,$0)
F = FOREACH E GENERATE $1,$2,$3;
STORE F INTO 'goodrecord' USING PigStorage('|');


--Flatten the bad records
G = FOREACH D GENERATE FLATTEN(badRecord);

--Get the required columns and skip the rank column(ie,$0)
H = FOREACH G GENERATE $1,$2,$3;
STORE H INTO 'badrecord' USING PigStorage('|');
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|987-654-3210|BLS
1234|234-123-0000|DEF
4567|369-258-0147|NNN
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|033-444-5555|XYZ
1234|007-654-3210|BLS
1234|230-123-0000|DEF
1234|303-444-5555|XYZ
1234|234-123-0000|DEF
1234|134-123-0000|DEF
1234|086-654-3210|BLS
1234|087-654-3210|BLS
4567|369-258-0147|NNN
4567|309-258-0147|NNN
4567|122-555-1111|ABC
4567|069-258-0147|NNN
4567|200-555-1111|ABC
8734|456-879-1234|QWE
8734|456-779-1234|QWE
9999|123-456-1890|PQR
9999|023-456-1890|PQR
场景2良好记录输出:

A =LOAD 'input.txt' USING PigStorage('|') AS (Acc:chararray,Phone:chararray,Name:chararray);
B = RANK A;
C = GROUP B BY Acc;
D = FOREACH C {
                sortInAsc = ORDER B BY rank_A ASC;
                top1 = LIMIT sortInAsc 1;
                GENERATE top1 AS goodRecord,SUBTRACT(B,top1) AS badRecord;
              }

--Flatten the good records
E = FOREACH D GENERATE FLATTEN(goodRecord);

--Get the required columns and skip the rank column(ie,$0)
F = FOREACH E GENERATE $1,$2,$3;
STORE F INTO 'goodrecord' USING PigStorage('|');


--Flatten the bad records
G = FOREACH D GENERATE FLATTEN(badRecord);

--Get the required columns and skip the rank column(ie,$0)
H = FOREACH G GENERATE $1,$2,$3;
STORE H INTO 'badrecord' USING PigStorage('|');
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|987-654-3210|BLS
1234|234-123-0000|DEF
4567|369-258-0147|NNN
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|033-444-5555|XYZ
1234|007-654-3210|BLS
1234|230-123-0000|DEF
1234|303-444-5555|XYZ
1234|234-123-0000|DEF
1234|134-123-0000|DEF
1234|086-654-3210|BLS
1234|087-654-3210|BLS
4567|369-258-0147|NNN
4567|309-258-0147|NNN
4567|122-555-1111|ABC
4567|069-258-0147|NNN
4567|200-555-1111|ABC
8734|456-879-1234|QWE
8734|456-779-1234|QWE
9999|123-456-1890|PQR
9999|023-456-1890|PQR
场景2不良记录输出:

A =LOAD 'input.txt' USING PigStorage('|') AS (Acc:chararray,Phone:chararray,Name:chararray);
B = RANK A;
C = GROUP B BY Acc;
D = FOREACH C {
                sortInAsc = ORDER B BY rank_A ASC;
                top1 = LIMIT sortInAsc 1;
                GENERATE top1 AS goodRecord,SUBTRACT(B,top1) AS badRecord;
              }

--Flatten the good records
E = FOREACH D GENERATE FLATTEN(goodRecord);

--Get the required columns and skip the rank column(ie,$0)
F = FOREACH E GENERATE $1,$2,$3;
STORE F INTO 'goodrecord' USING PigStorage('|');


--Flatten the bad records
G = FOREACH D GENERATE FLATTEN(badRecord);

--Get the required columns and skip the rank column(ie,$0)
H = FOREACH G GENERATE $1,$2,$3;
STORE H INTO 'badrecord' USING PigStorage('|');
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|987-654-3210|BLS
1234|234-123-0000|DEF
4567|369-258-0147|NNN
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|033-444-5555|XYZ
1234|007-654-3210|BLS
1234|230-123-0000|DEF
1234|303-444-5555|XYZ
1234|234-123-0000|DEF
1234|134-123-0000|DEF
1234|086-654-3210|BLS
1234|087-654-3210|BLS
4567|369-258-0147|NNN
4567|309-258-0147|NNN
4567|122-555-1111|ABC
4567|069-258-0147|NNN
4567|200-555-1111|ABC
8734|456-879-1234|QWE
8734|456-779-1234|QWE
9999|123-456-1890|PQR
9999|023-456-1890|PQR
你能试试这个吗

input.txt

1234|333-444-5555|XYZ
4567|222-555-1111|ABC
1234|234-123-0000|DEF
9999|123-456-1890|PQR
8734|456-879-1234|QWE
4567|369-258-0147|NNN
1234|987-654-3210|BLS
PigScript:

A =LOAD 'input.txt' USING PigStorage('|') AS (Acc:chararray,Phone:chararray,Name:chararray);
B = RANK A;
C = GROUP B BY Acc;
D = FOREACH C {
                sortInAsc = ORDER B BY rank_A ASC;
                top1 = LIMIT sortInAsc 1;
                GENERATE top1 AS goodRecord,SUBTRACT(B,top1) AS badRecord;
              }

--Flatten the good records
E = FOREACH D GENERATE FLATTEN(goodRecord);

--Get the required columns and skip the rank column(ie,$0)
F = FOREACH E GENERATE $1,$2,$3;
STORE F INTO 'goodrecord' USING PigStorage('|');


--Flatten the bad records
G = FOREACH D GENERATE FLATTEN(badRecord);

--Get the required columns and skip the rank column(ie,$0)
H = FOREACH G GENERATE $1,$2,$3;
STORE H INTO 'badrecord' USING PigStorage('|');
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|987-654-3210|BLS
1234|234-123-0000|DEF
4567|369-258-0147|NNN
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|033-444-5555|XYZ
1234|007-654-3210|BLS
1234|230-123-0000|DEF
1234|303-444-5555|XYZ
1234|234-123-0000|DEF
1234|134-123-0000|DEF
1234|086-654-3210|BLS
1234|087-654-3210|BLS
4567|369-258-0147|NNN
4567|309-258-0147|NNN
4567|122-555-1111|ABC
4567|069-258-0147|NNN
4567|200-555-1111|ABC
8734|456-879-1234|QWE
8734|456-779-1234|QWE
9999|123-456-1890|PQR
9999|023-456-1890|PQR
goodrecord输出1:

A =LOAD 'input.txt' USING PigStorage('|') AS (Acc:chararray,Phone:chararray,Name:chararray);
B = RANK A;
C = GROUP B BY Acc;
D = FOREACH C {
                sortInAsc = ORDER B BY rank_A ASC;
                top1 = LIMIT sortInAsc 1;
                GENERATE top1 AS goodRecord,SUBTRACT(B,top1) AS badRecord;
              }

--Flatten the good records
E = FOREACH D GENERATE FLATTEN(goodRecord);

--Get the required columns and skip the rank column(ie,$0)
F = FOREACH E GENERATE $1,$2,$3;
STORE F INTO 'goodrecord' USING PigStorage('|');


--Flatten the bad records
G = FOREACH D GENERATE FLATTEN(badRecord);

--Get the required columns and skip the rank column(ie,$0)
H = FOREACH G GENERATE $1,$2,$3;
STORE H INTO 'badrecord' USING PigStorage('|');
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|987-654-3210|BLS
1234|234-123-0000|DEF
4567|369-258-0147|NNN
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|033-444-5555|XYZ
1234|007-654-3210|BLS
1234|230-123-0000|DEF
1234|303-444-5555|XYZ
1234|234-123-0000|DEF
1234|134-123-0000|DEF
1234|086-654-3210|BLS
1234|087-654-3210|BLS
4567|369-258-0147|NNN
4567|309-258-0147|NNN
4567|122-555-1111|ABC
4567|069-258-0147|NNN
4567|200-555-1111|ABC
8734|456-879-1234|QWE
8734|456-779-1234|QWE
9999|123-456-1890|PQR
9999|023-456-1890|PQR
不良记录输出1:

A =LOAD 'input.txt' USING PigStorage('|') AS (Acc:chararray,Phone:chararray,Name:chararray);
B = RANK A;
C = GROUP B BY Acc;
D = FOREACH C {
                sortInAsc = ORDER B BY rank_A ASC;
                top1 = LIMIT sortInAsc 1;
                GENERATE top1 AS goodRecord,SUBTRACT(B,top1) AS badRecord;
              }

--Flatten the good records
E = FOREACH D GENERATE FLATTEN(goodRecord);

--Get the required columns and skip the rank column(ie,$0)
F = FOREACH E GENERATE $1,$2,$3;
STORE F INTO 'goodrecord' USING PigStorage('|');


--Flatten the bad records
G = FOREACH D GENERATE FLATTEN(badRecord);

--Get the required columns and skip the rank column(ie,$0)
H = FOREACH G GENERATE $1,$2,$3;
STORE H INTO 'badrecord' USING PigStorage('|');
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|987-654-3210|BLS
1234|234-123-0000|DEF
4567|369-258-0147|NNN
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|033-444-5555|XYZ
1234|007-654-3210|BLS
1234|230-123-0000|DEF
1234|303-444-5555|XYZ
1234|234-123-0000|DEF
1234|134-123-0000|DEF
1234|086-654-3210|BLS
1234|087-654-3210|BLS
4567|369-258-0147|NNN
4567|309-258-0147|NNN
4567|122-555-1111|ABC
4567|069-258-0147|NNN
4567|200-555-1111|ABC
8734|456-879-1234|QWE
8734|456-779-1234|QWE
9999|123-456-1890|PQR
9999|023-456-1890|PQR
场景2良好记录输出:

A =LOAD 'input.txt' USING PigStorage('|') AS (Acc:chararray,Phone:chararray,Name:chararray);
B = RANK A;
C = GROUP B BY Acc;
D = FOREACH C {
                sortInAsc = ORDER B BY rank_A ASC;
                top1 = LIMIT sortInAsc 1;
                GENERATE top1 AS goodRecord,SUBTRACT(B,top1) AS badRecord;
              }

--Flatten the good records
E = FOREACH D GENERATE FLATTEN(goodRecord);

--Get the required columns and skip the rank column(ie,$0)
F = FOREACH E GENERATE $1,$2,$3;
STORE F INTO 'goodrecord' USING PigStorage('|');


--Flatten the bad records
G = FOREACH D GENERATE FLATTEN(badRecord);

--Get the required columns and skip the rank column(ie,$0)
H = FOREACH G GENERATE $1,$2,$3;
STORE H INTO 'badrecord' USING PigStorage('|');
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|987-654-3210|BLS
1234|234-123-0000|DEF
4567|369-258-0147|NNN
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|033-444-5555|XYZ
1234|007-654-3210|BLS
1234|230-123-0000|DEF
1234|303-444-5555|XYZ
1234|234-123-0000|DEF
1234|134-123-0000|DEF
1234|086-654-3210|BLS
1234|087-654-3210|BLS
4567|369-258-0147|NNN
4567|309-258-0147|NNN
4567|122-555-1111|ABC
4567|069-258-0147|NNN
4567|200-555-1111|ABC
8734|456-879-1234|QWE
8734|456-779-1234|QWE
9999|123-456-1890|PQR
9999|023-456-1890|PQR
场景2不良记录输出:

A =LOAD 'input.txt' USING PigStorage('|') AS (Acc:chararray,Phone:chararray,Name:chararray);
B = RANK A;
C = GROUP B BY Acc;
D = FOREACH C {
                sortInAsc = ORDER B BY rank_A ASC;
                top1 = LIMIT sortInAsc 1;
                GENERATE top1 AS goodRecord,SUBTRACT(B,top1) AS badRecord;
              }

--Flatten the good records
E = FOREACH D GENERATE FLATTEN(goodRecord);

--Get the required columns and skip the rank column(ie,$0)
F = FOREACH E GENERATE $1,$2,$3;
STORE F INTO 'goodrecord' USING PigStorage('|');


--Flatten the bad records
G = FOREACH D GENERATE FLATTEN(badRecord);

--Get the required columns and skip the rank column(ie,$0)
H = FOREACH G GENERATE $1,$2,$3;
STORE H INTO 'badrecord' USING PigStorage('|');
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|987-654-3210|BLS
1234|234-123-0000|DEF
4567|369-258-0147|NNN
1234|333-444-5555|XYZ
4567|222-555-1111|ABC
8734|456-879-1234|QWE
9999|123-456-1890|PQR
1234|033-444-5555|XYZ
1234|007-654-3210|BLS
1234|230-123-0000|DEF
1234|303-444-5555|XYZ
1234|234-123-0000|DEF
1234|134-123-0000|DEF
1234|086-654-3210|BLS
1234|087-654-3210|BLS
4567|369-258-0147|NNN
4567|309-258-0147|NNN
4567|122-555-1111|ABC
4567|069-258-0147|NNN
4567|200-555-1111|ABC
8734|456-879-1234|QWE
8734|456-779-1234|QWE
9999|123-456-1890|PQR
9999|023-456-1890|PQR

成功了。非常感谢。减法在我的PIG版本中不可用。看起来它出现在0.12以上的版本中。我下载了Subtract UDF程序并添加到我的jar中。如果有更多记录,此代码不会将第一条记录作为良好记录返回。Naveen,你能粘贴你的输入,以便我可以看一看吗?Hi Sivasakthi,我在我的问题中添加了输入。情景2。请检查。我得到了良好的rec输出,为1234 | 007-654-3210 | BLS
4567 | 069-258-0147 | NNN
8734 | 456-779-1234 | QWE
9999 | 023-456-1890 | PQR
它起作用了。非常感谢。减法在我的PIG版本中不可用。看起来它出现在0.12以上的版本中。我下载了Subtract UDF程序并添加到我的jar中。如果有更多记录,此代码不会将第一条记录作为良好记录返回。Naveen,你能粘贴你的输入,以便我可以看一看吗?Hi Sivasakthi,我在我的问题中添加了输入。情景2。请检查。我得到了良好的rec输出,为1234 | 007-654-3210 | BLS
4567 | 069-258-0147 | NNN
8734 | 456-779-1234 | QWE
9999 | 023-456-1890 | PQR