Hadoop 猪场被困在一组

Hadoop 猪场被困在一组,hadoop,apache-pig,Hadoop,Apache Pig,我是一个PIG初学者(使用PIG 0.10.0),我有一些简单的JSON,如下所示: test.json: { "from": "1234567890", ..... "profile": { "email": "me@domain.com" ..... } } 我在pig中执行一些分组/计数: >pig -x local 使用以下清管器脚本: REGISTER /pig-udfs/oink.jar; REGISTER /pig-udfs/jso

我是一个PIG初学者(使用PIG 0.10.0),我有一些简单的JSON,如下所示:

test.json:

{
  "from": "1234567890",
  .....
  "profile": {
      "email": "me@domain.com"
      .....
  }
}
我在pig中执行一些分组/计数:

>pig -x local
使用以下清管器脚本:

REGISTER /pig-udfs/oink.jar;
REGISTER /pig-udfs/json-simple-1.1.jar;
REGISTER /pig-udfs/guava-12.0.jar;
REGISTER /pig-udfs/elephant-bird-2.2.3.jar;

users = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') as (json:map[]);

domain_user = FOREACH users GENERATE oink.EmailDomainFilter(json#'profile'#'email') as email, json#'from' as user_id;
DUMP domain_user; /* Outputs: (domain.com,1234567890) */

grouped_domain_user = GROUP domain_user BY email;
DUMP grouped_domain_user; /* Outputs: =stuck here= */
基本上,当我尝试转储分组的_domain_用户时,pig会被卡住,似乎在等待映射输出完成:

2012-05-31 17:45:22,111 [Thread-15] INFO  org.apache.hadoop.mapred.Task - Task 'attempt_local_0002_m_000000_0' done.
2012-05-31 17:45:22,119 [Thread-15] INFO  org.apache.hadoop.mapred.Task -  Using ResourceCalculatorPlugin : null
2012-05-31 17:45:22,123 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - ShuffleRamManager: MemoryLimit=724828160, MaxSingleShuffleLimit=181207040
2012-05-31 17:45:22,125 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging on-disk files
2012-05-31 17:45:22,128 [Thread for merging in memory files] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging in memory files
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-05-31 17:45:22,129 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress
2012-05-31 17:45:22,129 [Thread for polling Map Completion Events] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-05-31 17:45:22,129 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-05-31 17:45:28,118 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:31,122 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:37,123 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:43,124 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:46,124 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:52,126 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:45:58,127 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
2012-05-31 17:46:01,128 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > copy > 
.... repeats ....
欢迎就发生这种情况的原因提出建议

谢谢

更新

克里斯帮我解决了这个问题。我正在设置
fs.default.name
等以更正
pig.properties
中的值,但是我还设置了
HADOOP\u CONF\u DIR
环境变量,以指向本地HADOOP安装,并将这些sames值设置为
true


非常好的发现,非常感谢。

要将此问题标记为已回答,并对那些在未来遇到此问题的人:

在本地模式下运行时(无论是通过
pig-x local
运行pig,还是向本地作业运行程序提交map reduce作业,如果您看到reduce阶段“挂起”,尤其是如果您在日志中看到类似以下内容的条目:

2012-05-31 17:45:22,129 [Thread-15] INFO  org.apache.hadoop.mapred.ReduceTask - 
      attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress
12/05/22 14:28:29 WARN conf.Configuration: 
    file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: fs.default.name;  Ignoring.
12/05/22 14:28:29 WARN conf.Configuration: 
    file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker;  Ignoring.
然后,您的作业虽然以本地模式启动,但可能已切换到“集群”模式,因为$HADOOP/conf/mapred-site.xml中的
mapred.job.tracker
属性被标记为“final”:

<property>
    <name>mapred.job.tracker</name>
    <value>hdfs://localhost:9000</value>
    <final>true</final>
</property>

问题-就像最近的一篇文章一样,您没有将
fs.default.name
mapred.job.tracker
配置属性标记为final,是吗?-我实际上只是在pig.properties文件中设置了它们。我会检查以确保没有任何hadoop版本在路径中徘徊。另外,供参考,相同的脚本在在一个实时集群上运行g非常好。很好找到Chris!这就是问题所在。我想我的自制hadoop安装conf文件被读取了(那些参数被设置为final)。