Java 运行时的代数错误；“聚合”；数据集上的函数_Java_Hadoop_Hdfs_Apache Pig

Java 运行时的代数错误；“聚合”；数据集上的函数

java hadoop apache-pig

Java 运行时的代数错误；“聚合”；数据集上的函数,java,hadoop,hdfs,apache-pig,Java,Hadoop,Hdfs,Apache Pig,我正在通过hortonworks.com上的教程学习hadoop/pig/hive 我确实尝试过找到教程的链接，但不幸的是，它只附带了他们提供给您的ISA映像。他们的网站上实际上并没有托管 batting = load 'Batting.csv' using PigStorage(','); runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs; grp_data = GROUP runs by (year)

我正在通过hortonworks.com上的教程学习hadoop/pig/hive

我确实尝试过找到教程的链接，但不幸的是，它只附带了他们提供给您的ISA映像。他们的网站上实际上并没有托管

batting = load 'Batting.csv' using PigStorage(',');
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;
dump join_data;

我复制了他们的代码，正如教程中所述，我得到了以下输出：

2013-06-14 14:34:37,969 [main] INFO  org.apache.pig.Main - Apache Pig version 0.11.1.1.3.0.0-107 (rexported) compiled May 20 2013, 03:04:35
2013-06-14 14:34:37,977 [main] INFO  org.apache.pig.Main - Logging error messages to: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0020/attempt_201306140401_0020_m_000000_0/work/pig_1371245677965.log
2013-06-14 14:34:38,412 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /usr/lib/hadoop/.pigbootup not found
2013-06-14 14:34:38,598 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://sandbox:8020
2013-06-14 14:34:38,998 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: sandbox:50300
2013-06-14 14:34:40,819 [main] WARN  org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
2013-06-14 14:34:40,827 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN,GROUP_BY
2013-06-14 14:34:41,115 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2013-06-14 14:34:41,160 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner
2013-06-14 14:34:41,201 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POJoinPackage
2013-06-14 14:34:41,213 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 3
2013-06-14 14:34:41,213 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 map-reduce splittees.
2013-06-14 14:34:41,214 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - Merged 1 out of total 3 MR operators.
2013-06-14 14:34:41,214 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2
2013-06-14 14:34:41,488 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2013-06-14 14:34:41,551 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2013-06-14 14:34:41,555 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2013-06-14 14:34:41,559 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=6398990
2013-06-14 14:34:41,559 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2013-06-14 14:34:44,244 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job5371236206169131677.jar
2013-06-14 14:34:49,495 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job5371236206169131677.jar created
2013-06-14 14:34:49,517 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up multi store job
2013-06-14 14:34:49,529 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2013-06-14 14:34:49,530 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2013-06-14 14:34:49,530 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2013-06-14 14:34:49,755 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2013-06-14 14:34:50,144 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-06-14 14:34:50,145 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2013-06-14 14:34:50,256 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2013-06-14 14:34:50,316 [JobControl] INFO  com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library
2013-06-14 14:34:50,444 [JobControl] INFO  com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev cf4e7cbf8ed0f0622504d008101c2729dc0c9ff3]
2013-06-14 14:34:50,665 [JobControl] WARN  org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library is available
2013-06-14 14:34:50,666 [JobControl] INFO  org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop library
2013-06-14 14:34:50,666 [JobControl] INFO  org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library loaded
2013-06-14 14:34:50,680 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2013-06-14 14:34:52,796 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201306140401_0021
2013-06-14 14:34:52,796 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases batting,grp_data,max_runs,runs
2013-06-14 14:34:52,796 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: batting[1,10],runs[2,7],max_runs[4,11],grp_data[3,11] C: max_runs[4,11],grp_data[3,11] R: max_runs[4,11]
2013-06-14 14:34:52,796 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://sandbox:50030/jobdetails.jsp?jobid=job_201306140401_0021
2013-06-14 14:36:01,993 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2013-06-14 14:36:04,767 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2013-06-14 14:36:04,768 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201306140401_0021 has failed! Stop running all dependent jobs
2013-06-14 14:36:04,768 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2013-06-14 14:36:05,029 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2106: Error executing an algebraic function
2013-06-14 14:36:05,030 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2013-06-14 14:36:05,042 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion   PigVersion  UserId  StartedAt   FinishedAt  Features
1.2.0.1.3.0.0-107   0.11.1.1.3.0.0-107  mapred  2013-06-14 14:34:41 2013-06-14 14:36:05 HASH_JOIN,GROUP_BY

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_201306140401_0021   batting,grp_data,max_runs,runs  MULTI_QUERY,COMBINER    Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201306140401_0021_m_000000  

Input(s):
Failed to read data from "hdfs://sandbox:8020/user/hue/batting.csv"

Output(s):

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201306140401_0021   ->  null,
null


2013-06-14 14:36:05,042 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2013-06-14 14:36:05,043 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias join_data
Details at logfile: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0020/attempt_201306140401_0020_m_000000_0/work/pig_1371245677965.log

当切换此部分时：

MAX（runs.runs）

到

avg（runs.runs）

则我遇到了一个完全不同的问题：

2013-06-14 14:38:25,694 [main] INFO  org.apache.pig.Main - Apache Pig version 0.11.1.1.3.0.0-107 (rexported) compiled May 20 2013, 03:04:35
2013-06-14 14:38:25,695 [main] INFO  org.apache.pig.Main - Logging error messages to: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0022/attempt_201306140401_0022_m_000000_0/work/pig_1371245905690.log
2013-06-14 14:38:26,198 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /usr/lib/hadoop/.pigbootup not found
2013-06-14 14:38:26,438 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://sandbox:8020
2013-06-14 14:38:26,824 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: sandbox:50300
2013-06-14 14:38:28,238 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve avg using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /hadoop/mapred/taskTracker/hue/jobcache/job_201306140401_0022/attempt_201306140401_0022_m_000000_0/work/pig_1371245905690.log

有人知道问题可能是什么吗？

自定义项区分大小写，因此至少要回答问题的第二部分，您需要使用

AVG（runs.runs）

而不是

AVG（runs.runs）

很可能，一旦您纠正语法，您将得到您报告的原始错误…

UDF区分大小写，因此至少要回答问题的第二部分-您需要使用

AVG（runs.runs）

而不是

AVG（runs.runs）

很可能，一旦您纠正语法，您将得到您报告的原始错误…

我遇到了同样的问题，日志输出也完全相同，但是这个解决方案不起作用，因为我相信在这里用AVG更改MAX会破坏hortonworks.com教程的全部目的——它是让playerID每年获得最大跑步量

更新

最后我解决了这个问题-你要么删除Batting.csv（列名）中的第一行，要么像这样编辑你的Pig拉丁代码：

batting = LOAD ‘Batting.csv’ using PigStorage(‘,’);
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = FILTER runs_raw BY runs > 0;
grp_data = group runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
dump max_runs;

之后，您应该能够正确地完成教程并获得正确的结果

这似乎也是由于教程中使用的Pig rhich旧版本中的“bug”

更新

最后我解决了这个问题-你要么删除Batting.csv（列名）中的第一行，要么像这样编辑你的Pig拉丁代码：

batting = LOAD ‘Batting.csv’ using PigStorage(‘,’);
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = FILTER runs_raw BY runs > 0;
grp_data = group runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
dump max_runs;

之后，您应该能够正确地完成教程并获得正确的结果

这似乎也是因为教程中使用了Pig rhich的旧版本中的“bug”

我相信很多人都会明白这一点。我将Eugene的解决方案与Hortonworks的原始代码相结合，这样我们就可以得到教程中具体的精确输出。以下代码按照教程中的规定工作并生成精确的输出：

batting = LOAD 'Batting.csv' using PigStorage(',');
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = FILTER runs_raw BY runs > 0;
grp_data = group runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;

转储连接数据

注意：行“runs=FILTER runs\u raw BY runs>0；”是Hortonworks提供的额外内容，这要感谢Eugene共享工作代码，我用它来修改原始Hortonworks代码以使其工作。

batting = LOAD 'Batting.csv' using PigStorage(',');
runs_raw = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
runs = FILTER runs_raw BY runs > 0;
grp_data = group runs by (year);
max_runs = FOREACH grp_data GENERATE group as grp, MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;

转储连接数据

注意：由于Eugene共享了工作代码，我使用该代码修改了原始Hortonworks代码以使其正常工作，因此，行“runs=FILTER runs\u raw BY runs>0”是Hortonworks提供的额外数据。请为playerID、年份和运行指定适当的数据类型，如下所示：

跑步=每次击球产生$0作为球员ID:int，$1作为年份：chararray，$8作为跑步：int

不，它应该可以工作。

请为playerID、年份和运行指定适当的数据类型，如下所示：

跑步=每次击球产生$0作为球员ID:int，$1作为年份：chararray，$8作为跑步：int

不，它应该能用。

不，先生！AVG工作。马克斯不工作。可能是他们代码中的错误！不，先生！AVG工作。马克斯不工作。可能是他们代码中的错误！非常感谢！！！剩下的教程对你来说怎么样？你安装集群了吗？非常感谢！！！剩下的教程对你来说怎么样？你安装群集了吗？还有你的电子邮件：）还有你的电子邮件：）