Apache pig 如何根据dml验证我的清管器输入数据

Apache pig 如何根据dml验证我的清管器输入数据,apache-pig,Apache Pig,如何根据dml验证我的输入数据是否正确 输入数据: 豪尔赫·波萨达|洋基|{(接球手,2000年),|指定击球手,2001年)}|[比赛1594场,逐球命中65场,大满贯7场] 兰登·鲍威尔|奥克兰|{(接球手,2000年),(一垒手,2001年)}|[上垒率#0.297,比赛#26,本垒打#7] Martin Prado |亚特兰大|{(二垒手,2002年),(内野手,2003年),(左外野手)}|[第258场比赛,投球命中率3] 在粗体部分,我错过了年份字段。 bfile=加载'basket

如何根据dml验证我的输入数据是否正确

输入数据: 豪尔赫·波萨达|洋基|{(接球手,2000年),|指定击球手,2001年)}|[比赛1594场,逐球命中65场,大满贯7场] 兰登·鲍威尔|奥克兰|{(接球手,2000年),(一垒手,2001年)}|[上垒率#0.297,比赛#26,本垒打#7] Martin Prado |亚特兰大|{(二垒手,2002年),(内野手,2003年),(左外野手)}|[第258场比赛,投球命中率3]

在粗体部分,我错过了年份字段。 bfile=加载'basketball1.txt',使用PigStorage('|')作为(名称:chararray,团队:chararray,位置:bag{t:tuple(点:chararray,年份:int)},bat:map[])

转储文件; (豪尔赫·波萨达,洋基队,{(捕手,2000年),[指定击球手,2001年],[比赛1594场,投球命中65场,大满贯7场]) (兰登·鲍威尔,奥克兰,{(捕手,2000年),(一垒手,2001年)},[上垒率0.297,比赛26,本垒打7分]) (马丁·普拉多,亚特兰大,[games#258,pitch by#U pitch#3])

问候
Sanjeeb

这是用于模式的正则表达式脚本,我主要验证了所有字段。请根据您的输入运行,如果需要其他验证,请告诉我

Regex:

'^
   ([A-Za-z]+\\s+[A-Za-z]+)\\s*\\|\\s*
   ([A-Za-z]+)\\s*\\|\\s*
   (\\{(?:\\([A-Za-z_]+,[0-9]+\\))(?:,\\([A-Za-z_]+,[0-9]+\\))*\\})\\s*\\|\\s*
   (\\[(?:[A-Za-z_]+#[0-9\\.]+)(?:,[A-Za-z_]+#[0-9\\.]+)*\\])
 $'
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^([A-Za-z]+\\s+[A-Za-z]+)\\s*\\|\\s*([A-Za-z]+)\\s*\\|\\s*(\\{(?:\\([A-Za-z_]+,[0-9]+\\))(?:,\\([A-Za-z_]+,[0-9]+\\))*\\})\\s*\\|\\s*(\\[(?:[A-Za-z_]+#[0-9\\.]+)(?:,[A-Za-z_]+#[0-9\\.]+)*\\])$')) AS (name:chararray,team:chararray,pos:bag{t:(p:chararray)},bat:map[]);;
DUMP B;
input.txt
我已将下面的每个输入标记为有效或无效

Jorge Posada |Yankees| {(Catcher,2000),(Designated_hitter,2001)}|[games#1594,hit_by_pitch#65,grand_slams#7] -->Valid
Landon Powell |Oakland|{(Catcher,2000),(First_baseman,2001)}|[on_base_percentage#0.297,games#26,home_runs#7] ->Valid
Martin Prado |Atlanta| {(Second_baseman,2002),(Infielder,2003),(Left_fielder)}|[games#258,hit_by_pitch#3] -->Invalid year missing
Martin Prado |Atlanta| {(Second_baseman,2002)(Infielder,2003)}|[games#258,hit_by_pitch#3] ->Invalid no comma between two tuples
Martin Prado |Atlanta| {,(Second_baseman,2002),(Infielder,2003)}|[games#258,hit_by_pitch#3] --> Invalid comma in the start of tuple
Martin Prado |Atlanta| {(Second_baseman,2002),(,2003)}|[games#258,hit_by_pitch#3]  -->Invalid position is missing
Martin Prado |Atlanta| {(Second_baseman,2002),(Infielder,2003)}[games#258,hit_by_pitch#3] --> Invalid Demiiter | is missing
Martin Prado || {(Second_baseman,2002),(Infielder,2003)}[games#258,hit_by_pitch#3] --> Invalid Team name is missing
Martin Prado |Atlanta| {(Second_baseman,2002),(Infielder,2003)}[games#,hit_by_pitch#3] --> Invalid Key value is missing for games 
Landon Powell |Oakland|{(Catcher,2000)}|[on_base_percentage#0.297]  --> Valid
Landon Powell |Oakland|{(Catcher,2000),(First_baseman,2001),(test,3000)}|[on_base_percentage#0.297,games#26,home_runs#7,test#1.2]  -->valid
PigScript:

'^
   ([A-Za-z]+\\s+[A-Za-z]+)\\s*\\|\\s*
   ([A-Za-z]+)\\s*\\|\\s*
   (\\{(?:\\([A-Za-z_]+,[0-9]+\\))(?:,\\([A-Za-z_]+,[0-9]+\\))*\\})\\s*\\|\\s*
   (\\[(?:[A-Za-z_]+#[0-9\\.]+)(?:,[A-Za-z_]+#[0-9\\.]+)*\\])
 $'
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^([A-Za-z]+\\s+[A-Za-z]+)\\s*\\|\\s*([A-Za-z]+)\\s*\\|\\s*(\\{(?:\\([A-Za-z_]+,[0-9]+\\))(?:,\\([A-Za-z_]+,[0-9]+\\))*\\})\\s*\\|\\s*(\\[(?:[A-Za-z_]+#[0-9\\.]+)(?:,[A-Za-z_]+#[0-9\\.]+)*\\])$')) AS (name:chararray,team:chararray,pos:bag{t:(p:chararray)},bat:map[]);;
DUMP B;
输出:如果输入与架构不匹配,它将输出打印为空

(Jorge Posada,Yankees,{(Catcher,2000),(Designated_hitter,2001)},[games#1594,hit_by_pitch#65,grand_slams#7]) -->Valid
(Landon Powell,Oakland,{(Catcher,2000),(First_baseman,2001)},[on_base_percentage#0.297,games#26,home_runs#7]) -->Valid
() -->Invalid,Year missing
() -->Invalid,No comma between two tuples
() -->Invalid,Comma in the start of tuple
() -->Invalid,Position is missing
() -->Invalid,Demiiter | is missing
() -->Invalid Team name is missing
() -->Invalid Key value is missing for games 
(Landon Powell,Oakland,{(Catcher,2000)},[on_base_percentage#0.297]) -->Valid
(Landon Powell,Oakland,{(Catcher,2000),(First_baseman,2001),(test,3000)},[on_base_percentage#0.297,games#26,home_runs#7,test#1.2]) -->valid

是否可以添加更多样本以验证输入?有效和无效。