Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/289.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/hadoop/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 我的boto elastic mapreduce jar作业流参数有什么问题?_Python_Hadoop_Boto_Mahout_Amazon Emr - Fatal编程技术网

Python 我的boto elastic mapreduce jar作业流参数有什么问题?

Python 我的boto elastic mapreduce jar作业流参数有什么问题?,python,hadoop,boto,mahout,amazon-emr,Python,Hadoop,Boto,Mahout,Amazon Emr,我正在使用boto库在Amazons Elastic MapReduce Webservice(EMR)中创建作业流。以下代码应创建一个步骤: step2 = JarStep(name='Find similiar items', jar='s3n://recommendertest/mahout-core/mahout-core-0.5-SNAPSHOT.jar', main_class='org.apache.mahout.cf.taste.h

我正在使用boto库在Amazons Elastic MapReduce Webservice(EMR)中创建作业流。以下代码应创建一个步骤:

step2 = JarStep(name='Find similiar items',
            jar='s3n://recommendertest/mahout-core/mahout-core-0.5-SNAPSHOT.jar',
            main_class='org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob',
            step_args=['s3n://bucket/output/' + run_id + '/aggregate_watched/',
                       's3n://bucket/output/' + run_id + '/similiar_items/',
                       'SIMILARITY_PEARSON_CORRELATION'
                      ])
当我运行作业流时,它总是无法抛出以下错误:

java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/JobContext
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.JobContext
这是调用java代码的EMR日志中的行:

2011-01-24T22:18:54.491Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java \
-cp /home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop \
/hadoop-0.18-core.jar:/home/hadoop/hadoop-0.18-tools.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* \
-Xmx1000m \
-Dhadoop.log.dir=/mnt/var/log/hadoop/steps/3 \
-Dhadoop.log.file=syslog \
-Dhadoop.home.dir=/home/hadoop \
-Dhadoop.id.str=hadoop \
-Dhadoop.root.logger=INFO,DRFA \
-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/3/tmp \
-Djava.library.path=/home/hadoop/lib/native/Linux-i386-32 \
org.apache.hadoop.mapred.JobShell \
/mnt/var/lib/hadoop/steps/3/mahout-core-0.5-SNAPSHOT.jar \
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob \
s3n://..../output/job_2011-01-24_23:09:29/aggregate_watched/ \
s3n://..../output/job_2011-01-24_23:09:29/similiar_items/ \
SIMILARITY_PEARSON_CORRELATION
参数有什么问题?java类定义可在此处找到:


我找到了问题的解决方案:

  • 您需要在jobflow参数中指定hadoop版本0.20
  • 您需要使用mahout-core-0.5-SNAPSHOT-job.JAR运行JAR步骤,而不是使用mahout-core-0.5-SNAPSHOT.JAR
  • 如果在作业流中有额外的流式处理步骤,则需要修复boto中的错误:
  • 打开boto/emr/step.py
  • 将第138行更改为“return'/home/hadoop/contrib/streaming/hadoop streaming.jar'”
  • 保存并重新安装boto
  • 以下是调用job_flow函数以与mahout一起运行的方式:

    jobid=emr\u conn.run\u作业流(name=name,
    log_uri='s3n://'+main_bucket_name+'/emr logging/',
    启用\u调试=1,
    hadoop_version='0.20',
    
    步骤=[step1,step2])

    上述步骤2中描述的boto修复(即使用未版本化的hadoop-streamin.jar文件)已在此次提交中并入github主机:


    有关从boto执行此操作的参考信息

    import boto.emr.connection as botocon
    
    import boto.emr.step as step
    
    con = botocon.EmrConnection(aws_access_key_id='', aws_secret_access_key='')
    
    step = step.JarStep(name='Find similar items', jar='s3://mahout-core-0.6-job.jar', main_class='org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob', action_on_failure='CANCEL_AND_WAIT', step_args=['--input', 's3://', '--output', 's3://', '--similarityClassname', 'SIMILARITY_PEARSON_CORRELATION'])
    
    con.add_jobflow_steps('jflow', [step])
    

    显然,您需要将mahout-core-0.6-job.jar上传到一个可访问的s3位置。输入和输出必须是可访问的。

    lol。我花了一段时间才说服自己你不是在瞎编。@t.E.D:我同意你的看法。我想我已经老了…当我读到“jar”时,我会回答cooking.stackexchange.com谢谢你回来发布你的答案。如果其他人也有同样的问题,这可能会有所帮助。如果这真的是你问题的答案,你也应该接受它(点击分数旁边的复选标记)。我知道用这种方式让自己成为代表似乎很不明智,但它告诉SO系统找到了一个解决方案。