Pig-Join不';行不通

Pig-Join不';行不通,join,hadoop,bigdata,apache-pig,Join,Hadoop,Bigdata,Apache Pig,我有一个加入猪的问题。我先给你讲讲上下文。这是我的密码: -- START file loading start_file = LOAD 'dir/start_file.csv' USING PigStorage(';') as (PARTRANGE:chararray, COD_IPUSER:chararray); -- trim A = FOREACH start_file GENERATE TRIM(PARTRANGE) AS PARTRANGE, TRIM(COD_IPUSER

我有一个加入猪的问题。我先给你讲讲上下文。这是我的密码:

-- START file loading
start_file = LOAD 'dir/start_file.csv' USING PigStorage(';') as (PARTRANGE:chararray,     COD_IPUSER:chararray);

-- trim
A = FOREACH start_file GENERATE TRIM(PARTRANGE) AS PARTRANGE, TRIM(COD_IPUSER) AS COD_IPUSER;

dump A;
它给出了输出:

(79.92.147.88,20140310)
(79.92.147.88,20140310)
(109.31.67.3,20140310)
(109.31.67.3,20140310)
(109.7.229.143,20140310)
(109.8.114.133,20140310)
(77.198.79.99,20140310)
(77.200.174.171,20140310)
(77.200.174.171,20140310)
(109.17.117.212,20140310)
正在加载另一个文件:

-- Chargement du fichier recherche Hadopi
file2 = LOAD 'dir/file2.csv' USING PigStorage(';') as (IP_RECHERCHEE:chararray, DATE_HADO:chararray);

dump file2;
输出如下:

(2014/03/10 00:00:00,79.92.147.88)
(2014/03/10 00:00:01,79.92.147.88)
(2014/03/10 00:00:00,192.168.2.67)
(2014/03/10 00:00:00,79.92.147.88,,)
(2014/03/10 00:00:00,192.168.2.67,,)
(2014/03/10 00:00:01,79.92.147.88,,)
现在,我想做一个左外连接。代码如下:

result = JOIN file2 by IP_RECHERCHEE LEFT OUTER, A by COD_IPUSER;
dump result;
输出如下:

(2014/03/10 00:00:00,79.92.147.88)
(2014/03/10 00:00:01,79.92.147.88)
(2014/03/10 00:00:00,192.168.2.67)
(2014/03/10 00:00:00,79.92.147.88,,)
(2014/03/10 00:00:00,192.168.2.67,,)
(2014/03/10 00:00:01,79.92.147.88,,)
“file2”的所有记录都在这里,这很好,但start\u文件的任何记录都在这里。这就像连接失败一样

你知道问题出在哪里吗


谢谢。

您在
文件2中错误地标记了字段。您将第一个字段称为IP,第二个字段称为日期,此时,如您的
转储所示,情况正好相反。尝试
FOREACH file2 GENERATE IP\u RECHERCHEE
,您将看到您试图加入的字段。

您在
file2
中错误标记了字段。您将第一个字段称为IP,第二个字段称为日期,此时,如您的
转储所示,情况正好相反。尝试为每个文件2生成IP_RECHERCHEE,您将看到您试图加入的字段。

结果与预期一致。您正在调用Left outer join,它将查找文件2中的IP_RECHERCHEE字段与A的COD_IPUSER的匹配。
由于没有匹配,它将返回文件2中的所有IP_RECHERCHEE字段,并将A中的字段替换为null。

显然
2014/03/10 00:00:00!=20140310

结果与预期一致。您正在调用Left outer join,它将查找文件2中的IP_RECHERCHEE字段与A的COD_IPUSER的匹配。
由于没有匹配,它将返回文件2中的所有IP_RECHERCHEE字段,并将A中的字段替换为null。

显然
2014/03/10 00:00:00!=20140310

您的字段名称错误,并且您通过错误的字段加入。似乎您想通过IP地址加入

start_file = LOAD 'dir/start_file.csv' USING PigStorage(';') as (IP:chararray, PARTRANGE:chararray);

A = FOREACH start_file GENERATE TRIM(IP) AS IP, TRIM(PARTRANGE) AS PARTRANGE;

file2 = LOAD 'dir/file2.csv' USING PigStorage(';') as (DATE_HADO:chararray, IP:chararray);
我得到的是这个

(2014/03/10 00:00:00,192.168.2.67,,)
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,20140310)

您的字段名称错误,并且您通过错误的字段加入。似乎您想通过IP地址加入

start_file = LOAD 'dir/start_file.csv' USING PigStorage(';') as (IP:chararray, PARTRANGE:chararray);

A = FOREACH start_file GENERATE TRIM(IP) AS IP, TRIM(PARTRANGE) AS PARTRANGE;

file2 = LOAD 'dir/file2.csv' USING PigStorage(';') as (DATE_HADO:chararray, IP:chararray);
我得到的是这个

(2014/03/10 00:00:00,192.168.2.67,,)
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,20140310)