Apache pig 执行ForEach-Apache PIG时出错

Apache pig 执行ForEach-Apache PIG时出错,apache-pig,Apache Pig,我有3个日志,一个Squid,一个登录和一个注销。我需要通过这些日志来找出每个用户访问过哪些站点。 我正在使用Apache Pig并创建了以下脚本: copyFromLocal /home/marcelo/Documentos/hadoop/squid.txt /tmp/squid.txt; copyFromLocal /home/marcelo/Documentos/hadoop/samba.log_in /tmp/login.txt; copyFromLocal /home/marcelo/

我有3个日志,一个Squid,一个登录和一个注销。我需要通过这些日志来找出每个用户访问过哪些站点。 我正在使用Apache Pig并创建了以下脚本:

copyFromLocal /home/marcelo/Documentos/hadoop/squid.txt /tmp/squid.txt;
copyFromLocal /home/marcelo/Documentos/hadoop/samba.log_in /tmp/login.txt;
copyFromLocal /home/marcelo/Documentos/hadoop/samba.log_out /tmp/logout.txt;

squid = LOAD '/tmp/squid.txt' USING PigStorage AS (linha: chararray);
nsquid = FOREACH squid GENERATE FLATTEN (STRSPLIT(linha,'[ ]+'));
nsquid = FOREACH nsquid GENERATE $0 AS timeStamp:chararray, $2 AS ipCliente:chararray, $5 AS request:chararray, $6 AS url:chararray;
nsquid = FOREACH nsquid GENERATE FLATTEN (STRSPLIT(timeStamp,'[.]'))AS (timeStamp:int,resto:chararray),ipCliente,request,url;
nsquid = FOREACH nsquid GENERATE (int)$0 AS timeStamp:int, $2 AS ipCliente:chararray,$3 AS request:chararray, $4 AS url:chararray;    
connect = FILTER nsquid BY (request=='CONNECT');


login = LOAD '/tmp/login.txt' USING PigStorage(' ') AS  (serverAL: chararray, data: chararray, hora: chararray,  netlogon: chararray, on: chararray, ip: chararray);
nlogin = FOREACH login GENERATE FLATTEN(STRSPLIT(serverAL,'[\\\\]')),data, hora,FLATTEN(STRSPLIT(ip,'[\\\\]'));
nlogin = FOREACH nlogin GENERATE $1 AS al:chararray, $2 AS data:chararray, $3 AS hora:chararray, $4 AS ipCliente:chararray;

logout = LOAD '/tmp/logout.txt' USING PigStorage(' ') AS  (data: chararray, hora: chararray, logout: chararray,  ipAl: chararray, disconec: chararray);
nlogout = FOREACH logout GENERATE data, hora, FLATTEN(STRSPLIT(ipAl,'[\\\\]')); 
nlogout = FOREACH nlogout GENERATE $0 AS data:chararray,$1 AS hora:chararray,$2 AS ipCliente:chararray, $3 AS al:chararray; 

data = JOIN nlogin BY (al,ipCliente,data), nlogout BY (al,ipCliente,data);
ndata = FOREACH data GENERATE nlogin::al,ToUnixTime(ToDate(CONCAT(nlogin::data, nlogin::hora),'dd/MM/yyyyHH:mm:ss', 'GMT')) AS tslogin:int,ToUnixTime(ToDate(CONCAT(nlogout::data, nlogout::hora),'dd/MM/yyyyHH:mm:ss', 'GMT')) AS tslogout:int,nlogout::ipCliente;
BB = FOREACH ndata GENERATE $0 AS al:chararray, (int)$1 AS tslogin:int, (int)$2 AS tslogout:int, $3 AS ipCliente:chararray;
CC = JOIN BB BY ipCliente, connect BY ipCliente; 
DD = FOREACH CC GENERATE BB::al AS al:chararray, (int)BB::tslogin AS tslogin:int, (int)BB::tslogout AS tslogout:int,(int)connect::timeStamp AS timeStamp:int, connect::ipCliente AS ipCliente:chararray, connect::url AS url:chararray;
EE = FILTER DD BY (tslogin<=timeStamp) AND (timeStamp<=tslogout); 
STORE EE INTO 'EEs';
我创建了一个可行的替代方案,将倒数第二行替换为:

STORE DD INTO 'DD';
newDD = LOAD 'hdfs://localhost:9000/user/root/DD' USING PigStorage AS (al:chararray, tslogin:int, tslogout:int, timeStamp:int, ipCliente:chararray, url:chararray);
EE = FILTER newDD BY (tslogin<=timeStamp) AND (timeStamp<=tslogout); 

有人知道如何在没有“存储”的情况下修复它吗?

我认为问题出在squid.txt文件上。您能检查一下该文件吗。它清楚地显示了从/tmp/squid.txt读取数据失败。我认为该文件是正确的,当我进行转储以连接时,它会正确地生成结果
STORE DD INTO 'DD';
newDD = LOAD 'hdfs://localhost:9000/user/root/DD' USING PigStorage AS (al:chararray, tslogin:int, tslogout:int, timeStamp:int, ipCliente:chararray, url:chararray);
EE = FILTER newDD BY (tslogin<=timeStamp) AND (timeStamp<=tslogout);