Apache pig 在Pig中读取多个Orc文件

Apache pig 在Pig中读取多个Orc文件,apache-pig,glob,orc,Apache Pig,Glob,Orc,我正在尝试使用pig的OrcStorage()读取/加载目录中存在的多个Orc文件。我尝试使用glob技术,但这对我不起作用,并抛出错误,说文件不存在,因为它是可用的。请让我知道如何在pig中实现此功能 使用的示例文件: hadoop fs -ls /sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901} Found 2 items -rw-r--r-- 3 as303e hdfs 302986 201

我正在尝试使用pig的OrcStorage()读取/加载目录中存在的多个Orc文件。我尝试使用glob技术,但这对我不起作用,并抛出错误,说文件不存在,因为它是可用的。请让我知道如何在pig中实现此功能

使用的示例文件:

hadoop fs -ls /sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}
Found 2 items
-rw-r--r--   3 as303e hdfs     302986 2015-11-19 05:12 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111900/000000_0
-rw-r--r--   3 as303e hdfs     302986 2015-11-19 05:12 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111900/000001_0
Found 2 items
-rw-r--r--   3 as303e ksndbx28     302986 2015-11-25 04:34 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111901/000000_0
-rw-r--r--   3 as303e ksndbx28     302986 2015-11-25 04:34 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111901/000001_0
A = load '/sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}' Using OrcStorage();

B= limit A 2;

DUMP B;
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Store(hdfs://localhost:8020/tmp/temp666047359/tmp808921130:org.apache.pig.impl.io.InterStorage) - scope-5 Operator Key: scope-5): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Limit - scope-4 Operator Key: scope-4): org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNextTuple(POStore.java:159)
        at org.apache.pig.backend.hadoop.executionengine.fetch.FetchLauncher.runPipeline(FetchLauncher.java:161)
        at org.apache.pig.backend.hadoop.executionengine.fetch.FetchLauncher.launchPig(FetchLauncher.java:81)
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:278)
        at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
        at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
        at org.apache.pig.PigServer.storeEx(PigServer.java:1034)
        ... 15 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Limit - scope-4 Operator Key: scope-4): org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNextTuple(POLimit.java:122)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
        ... 22 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNextTuple(POLoad.java:131)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
        ... 24 more
Caused by: org.apache.hadoop.mapred.InvalidInputException: File does not exist: hdfs://localhost:8020/sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:961)
        at org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat.getSplits(OrcNewInputFormat.java:121)
        at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:190)
        at org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:146)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.setUp(POLoad.java:99)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNextTuple(POLoad.java:127)
        ... 25 more
使用的代码:

hadoop fs -ls /sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}
Found 2 items
-rw-r--r--   3 as303e hdfs     302986 2015-11-19 05:12 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111900/000000_0
-rw-r--r--   3 as303e hdfs     302986 2015-11-19 05:12 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111900/000001_0
Found 2 items
-rw-r--r--   3 as303e ksndbx28     302986 2015-11-25 04:34 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111901/000000_0
-rw-r--r--   3 as303e ksndbx28     302986 2015-11-25 04:34 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111901/000001_0
A = load '/sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}' Using OrcStorage();

B= limit A 2;

DUMP B;
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Store(hdfs://localhost:8020/tmp/temp666047359/tmp808921130:org.apache.pig.impl.io.InterStorage) - scope-5 Operator Key: scope-5): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Limit - scope-4 Operator Key: scope-4): org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNextTuple(POStore.java:159)
        at org.apache.pig.backend.hadoop.executionengine.fetch.FetchLauncher.runPipeline(FetchLauncher.java:161)
        at org.apache.pig.backend.hadoop.executionengine.fetch.FetchLauncher.launchPig(FetchLauncher.java:81)
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:278)
        at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
        at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
        at org.apache.pig.PigServer.storeEx(PigServer.java:1034)
        ... 15 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Limit - scope-4 Operator Key: scope-4): org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNextTuple(POLimit.java:122)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
        ... 22 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNextTuple(POLoad.java:131)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
        ... 24 more
Caused by: org.apache.hadoop.mapred.InvalidInputException: File does not exist: hdfs://localhost:8020/sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:961)
        at org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat.getSplits(OrcNewInputFormat.java:121)
        at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:190)
        at org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:146)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.setUp(POLoad.java:99)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNextTuple(POLoad.java:127)
        ... 25 more
错误日志:

hadoop fs -ls /sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}
Found 2 items
-rw-r--r--   3 as303e hdfs     302986 2015-11-19 05:12 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111900/000000_0
-rw-r--r--   3 as303e hdfs     302986 2015-11-19 05:12 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111900/000001_0
Found 2 items
-rw-r--r--   3 as303e ksndbx28     302986 2015-11-25 04:34 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111901/000000_0
-rw-r--r--   3 as303e ksndbx28     302986 2015-11-25 04:34 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111901/000001_0
A = load '/sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}' Using OrcStorage();

B= limit A 2;

DUMP B;
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Store(hdfs://localhost:8020/tmp/temp666047359/tmp808921130:org.apache.pig.impl.io.InterStorage) - scope-5 Operator Key: scope-5): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Limit - scope-4 Operator Key: scope-4): org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNextTuple(POStore.java:159)
        at org.apache.pig.backend.hadoop.executionengine.fetch.FetchLauncher.runPipeline(FetchLauncher.java:161)
        at org.apache.pig.backend.hadoop.executionengine.fetch.FetchLauncher.launchPig(FetchLauncher.java:81)
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:278)
        at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
        at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
        at org.apache.pig.PigServer.storeEx(PigServer.java:1034)
        ... 15 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Limit - scope-4 Operator Key: scope-4): org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNextTuple(POLimit.java:122)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
        ... 22 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNextTuple(POLoad.java:131)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
        ... 24 more
Caused by: org.apache.hadoop.mapred.InvalidInputException: File does not exist: hdfs://localhost:8020/sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:961)
        at org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat.getSplits(OrcNewInputFormat.java:121)
        at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:190)
        at org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:146)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.setUp(POLoad.java:99)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNextTuple(POLoad.java:127)
        ... 25 more
原因:org.apache.pig.backend.executionengine.ExecuteException:错误0:执行时发生异常(名称:B:Store)(hdfs://localhost:8020/tmp/temp666047359/tmp808921130:org.apache.pig.impl.io.InterStorage)-scope-5运算符键:scope-5):org.apache.pig.backend.executionengine.ExecuteException:错误0:执行时发生异常(名称:B:Limit-scope-4操作员键:scope-4):org.apache.pig.backend.executionengine.ExecutionException:错误2081:无法设置加载函数。
位于org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316)
位于org.apache.pig.backend.hadoop.executionengine.physicalayer.relationalOperators.POStore.getNextTuple(POStore.java:159)
位于org.apache.pig.backend.hadoop.executionengine.fetch.FetchLauncher.runPipeline(FetchLauncher.java:161)
位于org.apache.pig.backend.hadoop.executionengine.fetch.FetchLauncher.launchPig(FetchLauncher.java:81)
位于org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:278)
位于org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
位于org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
位于org.apache.pig.PigServer.storeEx(PigServer.java:1034)
…还有15个
原因:org.apache.pig.backend.executionengine.ExecutionException:错误0:执行时异常(名称:B:限制-范围-4运算符键:范围-4):org.apache.pig.backend.executionengine.ExecutionException:错误2081:无法设置加载函数。
位于org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316)
位于org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNextTuple(POLimit.java:122)
位于org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
…还有22个
原因:org.apache.pig.backend.executionengine.ExecuteException:错误2081:无法设置加载函数。
位于org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNextTuple(POLoad.java:131)
位于org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
…还有24个
原因:org.apache.hadoop.mapred.InvalidInputException:文件不存在:hdfs://localhost:8020/sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}
在org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:961)
位于org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat.getSplits(OrcNewInputFormat.java:121)
位于org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:190)
位于org.apache.pig.impl.io.ReadToEndLoader。(ReadToEndLoader.java:146)
位于org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.setUp(POLoad.java:99)
位于org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNextTuple(POLoad.java:127)
…还有25个

我对逗号也有同样的问题,你能尝试一个不使用逗号的解决方案吗:
a=load'/sandbox/sandbox28/pig\u demo/input/ORC/data\u dt=201511190*“使用OrcStorage()如果您使用逗号,我想您必须加载:
'hdfs://localhost:8020/sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111900,hdfs://localhost:8020/sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111901“
感谢您的建议@antonybrd不幸的是,我不能使用通配符选项数据\u dt=201511190*因为将有b我已经处理了很多源文件夹!!。我正在尝试只加载增量更改。现在我计划实现一个解决方案,该解决方案将使用其他包含多个加载语句的包装程序生成pig代码