Hadoop 配置单元NR映射进度不一致,并定期从0%重新启动
我在一个大约有1000条avro记录的数据集上有一个Thread MR(mapreduce有两个ec2实例)作业,映射阶段的行为不稳定。请参阅下面的进度。 当然,我检查了resourcemanager和NodeManager上的日志,没有发现任何可疑之处,但是这些日志太冗长了 那里发生了什么事Hadoop 配置单元NR映射进度不一致,并定期从0%重新启动,hadoop,mapreduce,hive,Hadoop,Mapreduce,Hive,我在一个大约有1000条avro记录的数据集上有一个Thread MR(mapreduce有两个ec2实例)作业,映射阶段的行为不稳定。请参阅下面的进度。 当然,我检查了resourcemanager和NodeManager上的日志,没有发现任何可疑之处,但是这些日志太冗长了 那里发生了什么事 hive> select * from nikon where qs_cs_s_aid='VIEW' limit 10; Total MapReduce jobs
hive> select * from nikon where qs_cs_s_aid='VIEW' limit 10;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1352281315350_0020, Tracking URL = http://blabla.ec2.internal:8088/proxy/application_1352281315350_0020/
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=blabla.com:8032 -kill job_1352281315350_0020
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 0
2012-11-07 11:14:40,976 Stage-1 map = 0%, reduce = 0%
2012-11-07 11:15:06,136 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 10.38 sec
2012-11-07 11:15:07,253 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 12.18 sec
2012-11-07 11:15:08,371 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 12.18 sec
2012-11-07 11:15:09,491 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 12.18 sec
2012-11-07 11:15:10,643 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 15.42 sec
(...)
2012-11-07 11:15:35,441 Stage-1 map = 28%, reduce = 0%, Cumulative CPU 37.77 sec
2012-11-07 11:15:36,486 Stage-1 map = 28%, reduce = 0%, Cumulative CPU 37.77 sec
here restart at 16% ?
2012-11-07 11:15:37,692 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 21.15 sec
2012-11-07 11:15:38,815 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 21.15 sec
2012-11-07 11:15:39,865 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 21.15 sec
2012-11-07 11:15:41,064 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 22.4 sec
2012-11-07 11:15:42,181 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 22.4 sec
2012-11-07 11:15:43,299 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 22.4 sec
here restart at 0% ?
2012-11-07 11:15:44,418 Stage-1 map = 0%, reduce = 0%
2012-11-07 11:16:02,076 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 6.86 sec
2012-11-07 11:16:03,193 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 6.86 sec
2012-11-07 11:16:04,259 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 8.45 sec
(...)
2012-11-07 11:16:31,291 Stage-1 map = 22%, reduce = 0%, Cumulative CPU 35.34 sec
2012-11-07 11:16:32,414 Stage-1 map = 26%, reduce = 0%, Cumulative CPU 37.93 sec
here restart at 11% ?
2012-11-07 11:16:33,459 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 19.53 sec
2012-11-07 11:16:34,507 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 19.53 sec
2012-11-07 11:16:35,731 Stage-1 map = 13%, reduce = 0%, Cumulative CPU 21.47 sec
(...)
2012-11-07 11:16:46,839 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 24.14 sec
here restart at 0% ?
2012-11-07 11:16:47,939 Stage-1 map = 0%, reduce = 0%
2012-11-07 11:16:56,653 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 7.54 sec
2012-11-07 11:16:57,814 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 7.54 sec
(...)
不用说,作业在一段时间后会崩溃并出现错误:java.io.IOException:java.io.IOException:java.lang.ArrayIndexOutOfBoundsException:-56这看起来就像hadoop在失败时重试映射任务(默认情况下,它会在不同的主机上重试三次),这就是它使作业更具容错性的原因
如果故障是由特定主机上的临时问题引起的(这种情况比您想象的要多),则此功能非常有用。然而,在您的情况下,您确实有一个数组越界异常,它是由配置单元查询中的某些内容引起的。我会检查失败的任务日志,尝试调试原因。您也可以分享您的create table语句吗?