Hadoop 在PIG中读取具有模式的文件

Hadoop 在PIG中读取具有模式的文件,hadoop,apache-pig,hadoop2,Hadoop,Apache Pig,Hadoop2,我有一个场景,我使用HCatStorer将40个不同模式的文件从一个目录加载到配置单元表 Directory : opt/inputfolder/ Input Files Pattern : inp1*.log, inp2*.log, ..... inp39*.log, inp40*.log. 我已经编写了一个pig-脚本,它可以读取包含40个模式的所有文件 但我的问题是,这40个文件是强制性的,我可能不会收到一些文件。在这种情况下,我会得到一个例外,说明: Caused by:

我有一个场景,我使用
HCatStorer
将40个不同模式的文件从一个目录加载到配置单元表

Directory : opt/inputfolder/ 
Input Files Pattern :

inp1*.log,
inp2*.log,
    .....
inp39*.log,
inp40*.log.
我已经编写了一个
pig
-脚本,它可以读取包含40个模式的所有文件

但我的问题是,这40个文件是强制性的,我可能不会收到一些文件。在这种情况下,我会得到一个例外,说明:

Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
           Input Pattern opt/ip_files/inp16*.log matches 0 files
有没有办法处理此异常?

我想用模式读取其余39个文件,即使该文件不存在

如果我的源文件是字符串(即香蕉_2014012.log、橙色_2014012.log、苹果_2014012.log),该怎么办

下面是我使用HCatStorer将这些文件中的数据加载到配置单元表的方法

*** Pseudo code ****
banana_src = LOAD banana_*.log' using PigStorage;
......
Store banana_src into BANANA using HCatStorer;

apple_src = LOAD banana_*.log' using PigStorage;
......
Store apple_src into APPLE using HCatStorer;

orange_src = LOAD banana_*.log' using PigStorage;
......
Store orange_src into ORANGE using HCatStorer;
如果任何src没有文件,则此Pig脚本将抛出错误,说明匹配模式为0,并且Pig Scrip将处于失败状态。即使一个源文件不可用,我希望我的Scrip加载其他表而不会使我的作业失败


谢谢。

你好,湿婆,谢谢你的回复。我明白你的意思。如果文件符合问题,你的方法是正确的。在我的实际场景中,文件名是字符串形式的,比如input_apple.log、input_orange.log、input_banana.log。每个文件模式将分别加载到各自的表中(即banana、apple、orange)。有可能是这样的:,我已经一天没有收到香蕉文件了。那么我如何在PIG脚本中处理它呢?另一个选项是,您可以为load函数编写一个UDF,并在UDF中处理异常。我从来没有尝试过UDF的加载功能,但你可以试试。谢谢湿婆。甚至我已经决定采用你提到的方法。再次感谢讨论。使用shell脚本更新了工作解决方案。我在当地测试过,效果很好。
 If you load inp1*.log, it matches inp16*.log also(if file present) but why are you again
 loading inp16*.log separately?

 Based on the above input i feel the below condition is sufficient for you.
        LOAD 'opt/ip_files/inp[1-9]*.log'

Please let me know if you are trying something different?

UPDATE:
I have one more option but not sure if this works for you.
1. Split your pig script into three parts say banana.pig, apple.pig and orange.pig each script will have their own logic.
2. Write a shell script to check existence of the each file pattern
3. If the files are present, call the corresponding pig script using pig params option else dont call. 
   In this option, if the files are not present that particular pig script will not be get triggred

Shellscript: test.sh
#!/bin/bash

BANANA_FILES="opt/ip_files/banana_*.log"
APPLE_FILES="opt/ip_files/apple_*.log"
ORANGE_FILES="opt/ip_files/orange_*.log"

if ls $BANANA_FILES > /dev/null 2>&1
then
    echo "Banana File Found"
    pig -x local -param PIG_BANANA_INPUT_FILES="$BANANA_FILES" -f banana.pig
else
    echo "No Banana files found"
fi

if ls $APPLE_FILES > /dev/null 2>&1
then
    echo "Apple File Found"
    pig -x local -param PIG_APPLE_INPUT_FILES="$APPLE_FILES" -f apple.pig
else
    echo "No APPLE files found"
fi

if ls $ORANGE_FILES > /dev/null 2>&1
then
    echo "Orange File Found"
    pig -x local -param PIG_ORANGE_INPUT_FILES="$ORANGE_FILES" -f orange.pig
else
    echo "No Orange files found"
fi


PigScript:banana.pig
banana_src = LOAD '$PIG_BANANA_INPUT_FILES' using PigStorage;
DUMP banana_src;

PigScript: apple.pig
apple_src = LOAD '$PIG_APPLE_INPUT_FILES' using PigStorage;
DUMP apple_src;

PigScript:orange.pig
orange_src = LOAD '$PIG_ORANGE_INPUT_FILES' using PigStorage;
DUMP orange_src;

Output1: All the three files are present
$ ./test.sh 
Banana File Found
(1,2,3,4,5)
(a,b,c,d,e)
Apple File Found
(test1,test2)
Orange File Found
(13,4,5)

Output2: Only banana files are present
$ ./test.sh 
Banana File Found
(1,2,3,4,5)
(a,b,c,d,e)
No APPLE files found
No Orange files found