Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/github/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache pig PIG脚本:将具有开始和结束日期的单行扩展为多行,每天一行_Apache Pig - Fatal编程技术网

Apache pig PIG脚本:将具有开始和结束日期的单行扩展为多行,每天一行

Apache pig PIG脚本:将具有开始和结束日期的单行扩展为多行,每天一行,apache-pig,Apache Pig,我需要一个PIG脚本来将包含活动Id、开始日期、结束日期和金额的单行转换为多行:每天一行,其中包含分配给该天的金额。例如,模式是: 活动ID、开始日期、结束日期、总金额 我的输入行有: 1,2015-01-01,2015-01-10,10000 我需要为此“活动”的每一天创建单独的行,该行将每天的总金额划分为如下模式: 活动ID、日期、金额 1,2015-01-01,1000 1,2015-01-02,1000 1,2015-01-03,1000 。。。etc活动的每一天一行 我希望我可以使

我需要一个PIG脚本来将包含活动Id、开始日期、结束日期和金额的单行转换为多行:每天一行,其中包含分配给该天的金额。例如,模式是: 活动ID、开始日期、结束日期、总金额

我的输入行有:

1,2015-01-01,2015-01-10,10000
我需要为此“活动”的每一天创建单独的行,该行将每天的总金额划分为如下模式:

活动ID、日期、金额

1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
。。。etc活动的每一天一行


我希望我可以使用嵌套的foreach和DaysBetween函数。

使用标准pig解决这个问题有点困难,挑战是在两个日期之间动态生成日期。假设月份重叠(
即,2015-01-28至2015-02-06
),则pig没有任何情报从1月开始生成4天,从2月开始生成6天

要解决这个问题,一个选项是将日期生成部分移动到自定义UDF,解析输入并生成中间日期

示例1:
一个输入
,日期不重叠

输入:

1,2015-01-01,2015-01-10,10000
REGISTER PARSEDATE.jar; 
A = LOAD 'input' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output' USING PigStorage(',');
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
1,2015-01-01,2015-01-10,10000
2,2015-01-28,2015-02-06,10000
REGISTER PARSEDATE.jar; 
A = LOAD 'input1' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output1' USING PigStorage(',');
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
2,2015-01-28,1000
2,2015-01-29,1000
2,2015-01-30,1000
2,2015-01-31,1000
2,2015-02-01,1000
2,2015-02-02,1000
2,2015-02-03,1000
2,2015-02-04,1000
2,2015-02-05,1000
2,2015-02-06,1000
PigScript:

1,2015-01-01,2015-01-10,10000
REGISTER PARSEDATE.jar; 
A = LOAD 'input' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output' USING PigStorage(',');
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
1,2015-01-01,2015-01-10,10000
2,2015-01-28,2015-02-06,10000
REGISTER PARSEDATE.jar; 
A = LOAD 'input1' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output1' USING PigStorage(',');
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
2,2015-01-28,1000
2,2015-01-29,1000
2,2015-01-30,1000
2,2015-01-31,1000
2,2015-02-01,1000
2,2015-02-02,1000
2,2015-02-03,1000
2,2015-02-04,1000
2,2015-02-05,1000
2,2015-02-06,1000
输出:

1,2015-01-01,2015-01-10,10000
REGISTER PARSEDATE.jar; 
A = LOAD 'input' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output' USING PigStorage(',');
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
1,2015-01-01,2015-01-10,10000
2,2015-01-28,2015-02-06,10000
REGISTER PARSEDATE.jar; 
A = LOAD 'input1' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output1' USING PigStorage(',');
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
2,2015-01-28,1000
2,2015-01-29,1000
2,2015-01-30,1000
2,2015-01-31,1000
2,2015-02-01,1000
2,2015-02-02,1000
2,2015-02-03,1000
2,2015-02-04,1000
2,2015-02-05,1000
2,2015-02-06,1000
示例2:
两个输入
第一个输入
不重叠,
第二个输入
重叠

输入1:

1,2015-01-01,2015-01-10,10000
REGISTER PARSEDATE.jar; 
A = LOAD 'input' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output' USING PigStorage(',');
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
1,2015-01-01,2015-01-10,10000
2,2015-01-28,2015-02-06,10000
REGISTER PARSEDATE.jar; 
A = LOAD 'input1' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output1' USING PigStorage(',');
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
2,2015-01-28,1000
2,2015-01-29,1000
2,2015-01-30,1000
2,2015-01-31,1000
2,2015-02-01,1000
2,2015-02-02,1000
2,2015-02-03,1000
2,2015-02-04,1000
2,2015-02-05,1000
2,2015-02-06,1000
PigScript:

1,2015-01-01,2015-01-10,10000
REGISTER PARSEDATE.jar; 
A = LOAD 'input' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output' USING PigStorage(',');
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
1,2015-01-01,2015-01-10,10000
2,2015-01-28,2015-02-06,10000
REGISTER PARSEDATE.jar; 
A = LOAD 'input1' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output1' USING PigStorage(',');
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
2,2015-01-28,1000
2,2015-01-29,1000
2,2015-01-30,1000
2,2015-01-31,1000
2,2015-02-01,1000
2,2015-02-02,1000
2,2015-02-03,1000
2,2015-02-04,1000
2,2015-02-05,1000
2,2015-02-06,1000
输出:

1,2015-01-01,2015-01-10,10000
REGISTER PARSEDATE.jar; 
A = LOAD 'input' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output' USING PigStorage(',');
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
1,2015-01-01,2015-01-10,10000
2,2015-01-28,2015-02-06,10000
REGISTER PARSEDATE.jar; 
A = LOAD 'input1' Using PigStorage(',') AS (campaignId:int,startDate,endDate,totalAmount:int);
B = FOREACH A GENERATE campaignId,DaysBetween((datetime)endDate,(datetime)startDate)+1 AS cnt, totalAmount,TOBAG(*) AS mybag;
C = FOREACH B GENERATE campaignId,FLATTEN(TOKENIZE(mypackage.PARSEDATE(BagToString(mybag)),'#')),(int)(totalAmount/cnt) AS totalAmount;
STORE C INTO 'output1' USING PigStorage(',');
1,2015-01-01,1000
1,2015-01-02,1000
1,2015-01-03,1000
1,2015-01-04,1000
1,2015-01-05,1000
1,2015-01-06,1000
1,2015-01-07,1000
1,2015-01-08,1000
1,2015-01-09,1000
1,2015-01-10,1000
2,2015-01-28,1000
2,2015-01-29,1000
2,2015-01-30,1000
2,2015-01-31,1000
2,2015-02-01,1000
2,2015-02-02,1000
2,2015-02-03,1000
2,2015-02-04,1000
2,2015-02-05,1000
2,2015-02-06,1000
您需要编译下面的java代码并生成
PARSEDATE.jar
文件,并将其包含到pig脚本中。我只是暂时写了这段代码,你可以根据需要进行优化

PARSEDATE.java

package mypackage;
import java.io.*;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.joda.time.LocalDate;
import org.joda.time.Days;

public class PARSEDATE extends EvalFunc<String> {
        public String exec(Tuple input) throws IOException {

                //Get the input String from request
                String inputString = (String)input.get(0);

                //Get Startdate from second column
                String startDate = inputString.split("_")[1];

                //Get enddate from third column
                String endDate = inputString.split("_")[2];

                LocalDate st = new LocalDate(startDate);
                LocalDate et = new LocalDate(endDate);

                //Calculate days between given dates
                int days = Days.daysBetween(st, et).getDays()+1;

                //Append all the dates as String
                String output="";
                for (int index=0; index < days; index++) 
                {
                   //Each dates are delimited by '#', so it will be easy to parse in the pig script.                     
                   output = output+"#"+st.plusDays(index).toString();
                }
                return output;
        }
}
package-mypackage;
导入java.io.*;
导入org.apache.pig.EvalFunc;
导入org.apache.pig.data.Tuple;
导入org.joda.time.LocalDate;
导入org.joda.time.Days;
公共类PARSEDATE扩展了EvalFunc{
公共字符串exec(元组输入)引发IOException{
//从请求获取输入字符串
String inputString=(String)input.get(0);
//从第二列获取Startdate
String startDate=inputString.split(“”)[1];
//从第三列获取enddate
字符串结束日期=inputString.split(“”)[2];
LocalDate st=新的LocalDate(startDate);
LocalDate et=新的LocalDate(endDate);
//计算给定日期之间的天数
int days=days.daysBetween(st,et).getDays()+1;
//将所有日期追加为字符串
字符串输出=”;
对于(int index=0;index
PIG中的DaysBetween函数不处理两个日期之间的天数吗?这是我希望有用的函数:>