Apache pig 无法打开别名的迭代器_Apache Pig_Pig Udf

Apache pig 无法打开别名的迭代器

apache-pig

Apache pig 无法打开别名的迭代器,apache-pig,pig-udf,Apache Pig,Pig Udf,我知道这是重复最多的问题之一。我几乎找遍了所有地方，没有任何资源可以解决我面临的问题。下面是我的问题陈述的简化版本。但在实际数据中，数据并不复杂，所以我必须使用UDF 我的输入文件：input.txt NotNeeded1,NotNeeded11;Needed1 NotNeeded2,NotNeeded22;Needed2 我希望输出是 Needed1 Needed2 所以，我在写下面的UDF Java代码： package com.company.pig; import java.io

我知道这是重复最多的问题之一。我几乎找遍了所有地方，没有任何资源可以解决我面临的问题。下面是我的问题陈述的简化版本。但在实际数据中，数据并不复杂，所以我必须使用UDF

我的输入文件：input.txt

NotNeeded1,NotNeeded11;Needed1
NotNeeded2,NotNeeded22;Needed2

我希望输出是

Needed1
Needed2

所以，我在写下面的UDF Java代码：

package com.company.pig;

import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class myudf extends EvalFunc<String>{
    public String exec(Tuple input) throws IOException {
        if (input == null || input.size() == 0)
            return null;
        String s = (String)input.get(0);
        String str = s.split("\\,")[1];
        String str1 = str.split("\\;")[1];
        return str1;
    }
}

下面是我的猪壳代码

grunt> REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar;
grunt> DEFINE myudf com.company.pig.myudf;
grunt> data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' USING PigStorage(',');
grunt> extract = FOREACH data GENERATE myudf($1);
grunt> DUMP extract;

我得到以下错误：

2017-05-15 15:58:15,493 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2017-05-15 15:58:15,577 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2017-05-15 15:58:15,659 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2017-05-15 15:58:15,774 [main] INFO  org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2017-05-15 15:58:15,865 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2017-05-15 15:58:15,923 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2017-05-15 15:58:15,923 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2017-05-15 15:58:16,184 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:16,196 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:16,396 [main] INFO  org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:16,576 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2017-05-15 15:58:16,580 [main] WARN  org.apache.pig.tools.pigstats.ScriptState - unable to read pigs manifest file
2017-05-15 15:58:16,584 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2017-05-15 15:58:16,588 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2017-05-15 15:58:17,258 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/pig/rollupreg_extract-jar-with-dependencies.jar to DistributedCache through /tmp/temp-1119775568/tmp-858482998/rollupreg_extract-jar-with-dependencies.jar
2017-05-15 15:58:17,276 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2017-05-15 15:58:17,294 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2017-05-15 15:58:17,295 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2017-05-15 15:58:17,295 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2017-05-15 15:58:17,354 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2017-05-15 15:58:17,510 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:17,511 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:17,511 [JobControl] INFO  org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:17,753 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2017-05-15 15:58:17,820 [JobControl] INFO  org.apache.pig.builtin.PigStorage - Using PigTextInputFormat
2017-05-15 15:58:17,830 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2017-05-15 15:58:17,830 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2017-05-15 15:58:17,884 [JobControl] INFO  com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library
2017-05-15 15:58:17,889 [JobControl] INFO  com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev 7a4b57bedce694048432dd5bf5b90a6c8ccdba80]
2017-05-15 15:58:17,922 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2017-05-15 15:58:18,525 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2017-05-15 15:58:18,692 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1494853652295_0023
2017-05-15 15:58:18,879 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2017-05-15 15:58:18,973 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1494853652295_0023
2017-05-15 15:58:19,029 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1494853652295_0023/
2017-05-15 15:58:19,030 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1494853652295_0023
2017-05-15 15:58:19,030 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases data,extract
2017-05-15 15:58:19,030 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: data[2,7],extract[3,10] C:  R:
2017-05-15 15:58:19,044 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2017-05-15 15:58:19,044 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1494853652295_0023]
2017-05-15 15:58:29,156 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2017-05-15 15:58:29,156 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1494853652295_0023 has failed! Stop running all dependent jobs
2017-05-15 15:58:29,157 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2017-05-15 15:58:29,790 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:29,791 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:29,793 [main] INFO  org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:30,311 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
2017-05-15 15:58:30,312 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at sandbox.hortonworks.com/172.17.0.2:8050
2017-05-15 15:58:30,313 [main] INFO  org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at sandbox.hortonworks.com/172.17.0.2:10200
2017-05-15 15:58:30,465 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
2017-05-15 15:58:30,467 [main] WARN  org.apache.pig.tools.pigstats.ScriptState - unable to read pigs manifest file
2017-05-15 15:58:30,472 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
2.7.3.2.5.0.0-1245              root    2017-05-15 15:58:16     2017-05-15 15:58:30     UNKNOWN

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_1494853652295_0023  data,extract    MAP_ONLY        Message: Job failed!    hdfs://sandbox.hortonworks.com:8020/tmp/temp-1119775568/tmp-1619300225,

Input(s):
Failed to read data from "/pig_hdfs/input.txt"

Output(s):
Failed to produce result in "hdfs://sandbox.hortonworks.com:8020/tmp/temp-1119775568/tmp-1619300225"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1494853652295_0023


2017-05-15 15:58:30,472 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2017-05-15 15:58:30,499 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias extract
Details at logfile: /pig/pig_1494863836458.log

我知道这是抱怨

Failed to read data from "/pig_hdfs/input.txt"

但我相信这不是真正的问题。如果我不使用udf并直接转储数据，我就会得到输出。因此，这不是问题所在。

首先，您不需要自定义项来获得所需的输出。您可以在load语句中使用分号作为分隔符并获得所需的列

data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' USING PigStorage(';');
extract = FOREACH data GENERATE $1;
DUMP extract;

如果坚持使用自定义项，则必须将记录加载到单个字段中，然后使用自定义项。此外，自定义项不正确。应使用“；”拆分字符串s作为分隔符，从pig脚本传递

String s = (String)input.get(0);
String str1 = s.split("\\;")[1];

在pig脚本中，需要将整个记录加载到1个字段中，并在字段$0上使用udf

REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar;
DEFINE myudf com.company.pig.myudf;
data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' AS (f1:chararray);
extract = FOREACH data GENERATE myudf($0);
DUMP extract;

我做到了。但猪也提出了同样的问题。如果我只是注册jar，然后在加载数据后转储数据——尽管没有使用UDFYou有2列，如果使用；那你不觉得吗会把它分成两列吗？PIG不是这里的问题。请检查您的输入和脚本。另外，如果您使用UDF，您是否更改了java UDF和pigscript？如果您不使用UDF，注册UDF有什么意义？我的数据很复杂。为了简单起见，我制作了一些样本数据。当我发出转储数据时，没有寄存器，它正确地加载了数据。在我注册jar的那一刻，我所有的DUMP和STORE命令都失败了。我找到了解决办法。。由于UDF jar，它失败了。这是因为org.apache.pig v0.12依赖项不兼容。我尝试了同一依赖项的其他版本，但没有成功。最后，答案中的依赖性解决了这个问题

REGISTER /pig/rollupreg_extract-jar-with-dependencies.jar;
DEFINE myudf com.company.pig.myudf;
data = LOAD 'hdfs://sandbox.hortonworks.com:8020/pig_hdfs/input.txt' AS (f1:chararray);
extract = FOREACH data GENERATE myudf($0);
DUMP extract;