Hadoop 猪场被困在一组
我是一个PIG初学者(使用PIG 0.10.0),我有一些简单的JSON,如下所示: test.json:Hadoop 猪场被困在一组,hadoop,apache-pig,Hadoop,Apache Pig,我是一个PIG初学者(使用PIG 0.10.0),我有一些简单的JSON,如下所示: test.json: { "from": "1234567890", ..... "profile": { "email": "me@domain.com" ..... } } 我在pig中执行一些分组/计数: >pig -x local 使用以下清管器脚本: REGISTER /pig-udfs/oink.jar; REGISTER /pig-udfs/jso
{
"from": "1234567890",
.....
"profile": {
"email": "me@domain.com"
.....
}
}
我在pig中执行一些分组/计数:
>pig -x local
使用以下清管器脚本:
REGISTER /pig-udfs/oink.jar;
REGISTER /pig-udfs/json-simple-1.1.jar;
REGISTER /pig-udfs/guava-12.0.jar;
REGISTER /pig-udfs/elephant-bird-2.2.3.jar;
users = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') as (json:map[]);
domain_user = FOREACH users GENERATE oink.EmailDomainFilter(json#'profile'#'email') as email, json#'from' as user_id;
DUMP domain_user; /* Outputs: (domain.com,1234567890) */
grouped_domain_user = GROUP domain_user BY email;
DUMP grouped_domain_user; /* Outputs: =stuck here= */
基本上,当我尝试转储分组的_domain_用户时,pig会被卡住,似乎在等待映射输出完成:
2012-05-31 17:45:22,111 [Thread-15] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local_0002_m_000000_0' done.
2012-05-31 17:45:22,119 [Thread-15] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : null
2012-05-31 17:45:22,123 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - ShuffleRamManager: MemoryLimit=724828160, MaxSingleShuffleLimit=181207040
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging on-disk files
2012-05-31 17:45:22,128 [Thread for merging in memory files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging in memory files
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-05-31 17:45:22,129 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress
2012-05-31 17:45:22,129 [Thread for polling Map Completion Events] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-05-31 17:45:22,129 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-05-31 17:45:28,118 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:31,122 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:37,123 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:43,124 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:46,124 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:52,126 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:58,127 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:46:01,128 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
.... repeats ....
欢迎就发生这种情况的原因提出建议
谢谢
更新
克里斯帮我解决了这个问题。我正在设置fs.default.name
等以更正pig.properties
中的值,但是我还设置了HADOOP\u CONF\u DIR
环境变量,以指向本地HADOOP安装,并将这些sames值设置为true
非常好的发现,非常感谢。要将此问题标记为已回答,并对那些在未来遇到此问题的人: 在本地模式下运行时(无论是通过
pig-x local
运行pig,还是向本地作业运行程序提交map reduce作业,如果您看到reduce阶段“挂起”,尤其是如果您在日志中看到类似以下内容的条目:
2012-05-31 17:45:22,129 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask -
attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress
12/05/22 14:28:29 WARN conf.Configuration:
file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: fs.default.name; Ignoring.
12/05/22 14:28:29 WARN conf.Configuration:
file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker; Ignoring.
然后,您的作业虽然以本地模式启动,但可能已切换到“集群”模式,因为$HADOOP/conf/mapred-site.xml中的mapred.job.tracker
属性被标记为“final”:
<property>
<name>mapred.job.tracker</name>
<value>hdfs://localhost:9000</value>
<final>true</final>
</property>
问题-就像最近的一篇文章一样,您没有将
fs.default.name
和mapred.job.tracker
配置属性标记为final,是吗?-我实际上只是在pig.properties文件中设置了它们。我会检查以确保没有任何hadoop版本在路径中徘徊。另外,供参考,相同的脚本在在一个实时集群上运行g非常好。很好找到Chris!这就是问题所在。我想我的自制hadoop安装conf文件被读取了(那些参数被设置为final)。