使用Hadoop流式API运行java MapReduce_Java_Hadoop_Mapreduce_Streaming

使用Hadoop流式API运行java MapReduce

java hadoop mapreduce streaming

使用Hadoop流式API运行java MapReduce,java,hadoop,mapreduce,streaming,Java,Hadoop,Mapreduce,Streaming,我已经开发了自己的mapper.java和reducer.java，并希望将它们作为hadoop作业运行。我已经配置了一个单节点hadoop群集，并像这样运行MapReduce： $ bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -file mapper.class \ -mapper 'java mapper' -file reducer.class -reducer 'java reducer' \ -input /h

我已经开发了自己的mapper.java和reducer.java，并希望将它们作为hadoop作业运行。我已经配置了一个单节点hadoop群集，并像这样运行MapReduce：

$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -file mapper.class \
-mapper 'java mapper' -file reducer.class -reducer 'java reducer' \
-input /home/hdpuser/gutenberg/* -output /home/hdpuser/gutenberg.out

packageJobJar: [mapper.class, reducer.class, /usr/hadoop/tmp/hadoop-unjar1486800984159594392/] [] /tmp/streamjob6918733297327109918.jar tmpDir=null
14/03/05 10:52:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/03/05 10:52:20 WARN snappy.LoadSnappy: Snappy native library not loaded
14/03/05 10:52:20 INFO mapred.FileInputFormat: Total input paths to process : 6
14/03/05 10:52:20 INFO streaming.StreamJob: getLocalDirs(): [/usr/hadoop/tmp/mapred/local]
14/03/05 10:52:20 INFO streaming.StreamJob: Running job: job_201403041518_0020
14/03/05 10:52:20 INFO streaming.StreamJob: To kill this job, run:
14/03/05 10:52:20 INFO streaming.StreamJob: /usr/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201403041518_0020
14/03/05 10:52:20 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201403041518_0020
14/03/05 10:52:21 INFO streaming.StreamJob:  map 0%  reduce 0%
14/03/05 10:52:49 INFO streaming.StreamJob:  map 100%  reduce 100%
14/03/05 10:52:49 INFO streaming.StreamJob: To kill this job, run:
14/03/05 10:52:49 INFO streaming.StreamJob: /usr/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201403041518_0020
14/03/05 10:52:49 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201403041518_0020
14/03/05 10:52:49 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201403041518_0020_m_000001
14/03/05 10:52:49 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

你能解释一下如何使用我自己的map reduce文件吗

下面是mapper.java（没有hadoop libs，只有java）：

下面是我如何测试代码的：

$ echo "foo foo quux labs foo bar quux" | java mapper | sort -k1,1 | java reducer
bar 1
foo 3
labs 1
quux 2

还有什么我应该分享的吗

下面是我如何测试python代码的

$ echo "foo foo quux labs foo bar quux" | python mapper.py | sort -k1,1 | python reducer.py 
bar 1
foo 3
labs    1
quux    2

这里我介绍了我是如何执行hadoop作业的，它是如何工作的

$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -file mapper.py \
-mapper 'python mapper.py' -file reducer.py -reducer 'python reducer.py' \
-input /home/hdpuser/gutenberg/* -output /home/hdpuser/gutenberg.out1

packageJobJar: [mapper.py, reducer.py, /usr/hadoop/tmp/hadoop-unjar272415560722407865/] [] /tmp/streamjob3055337726170986279.jar tmpDir=null
14/03/05 11:07:35 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/03/05 11:07:35 WARN snappy.LoadSnappy: Snappy native library not loaded
14/03/05 11:07:35 INFO mapred.FileInputFormat: Total input paths to process : 6
14/03/05 11:07:36 INFO streaming.StreamJob: getLocalDirs(): [/usr/hadoop/tmp/mapred/local]
14/03/05 11:07:36 INFO streaming.StreamJob: Running job: job_201403041518_0021
14/03/05 11:07:36 INFO streaming.StreamJob: To kill this job, run:
14/03/05 11:07:36 INFO streaming.StreamJob: /usr/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201403041518_0021
14/03/05 11:07:36 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201403041518_0021
14/03/05 11:07:37 INFO streaming.StreamJob:  map 0%  reduce 0%
14/03/05 11:07:43 INFO streaming.StreamJob:  map 33%  reduce 0%
14/03/05 11:07:49 INFO streaming.StreamJob:  map 67%  reduce 0%
14/03/05 11:07:53 INFO streaming.StreamJob:  map 100%  reduce 22%
14/03/05 11:08:03 INFO streaming.StreamJob:  map 100%  reduce 100%
14/03/05 11:08:05 INFO streaming.StreamJob: Job complete: job_201403041518_0021
14/03/05 11:08:05 INFO streaming.StreamJob: Output: /home/hdpuser/gutenberg.out1

我已经核实了结果

我还尝试创建jar文件并运行jar

$ jar cvfe reducer.jar reducer reducer.class 
added manifest
adding: reducer.class(in = 1268) (out= 726)(deflated 42%)
$ jar cvfe mapper.jar mapper mapper.class 
added manifest
adding: mapper.class(in = 970) (out= 577)(deflated 40%)

$ echo "foo foo quux labs foo bar quux" | java -jar mapper.jar | sort -k1,1 | java -jar reducer.jar
bar 1
foo 3
labs 1
quux 2

然后使用JAR来实现hadoop，但没有效果

$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -file mapper.jar \
-mapper 'java -jar mapper.jar' -file reducer.jar -reducer 'java -jar reducer.jar' \
-input /home/hdpuser/gutenberg/* -output /home/hdpuser/gutenberg.out3

packageJobJar: [mapper.jar, reducer.jar, /usr/hadoop/tmp/hadoop-unjar1923907702869068962/] [] /tmp/streamjob7767637153401518705.jar tmpDir=null
14/03/05 12:41:52 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/03/05 12:41:52 WARN snappy.LoadSnappy: Snappy native library not loaded
14/03/05 12:41:52 INFO mapred.FileInputFormat: Total input paths to process : 6
14/03/05 12:41:52 INFO streaming.StreamJob: getLocalDirs(): [/usr/hadoop/tmp/mapred/local]
14/03/05 12:41:52 INFO streaming.StreamJob: Running job: job_201403041518_0023
14/03/05 12:41:52 INFO streaming.StreamJob: To kill this job, run:
14/03/05 12:41:52 INFO streaming.StreamJob: /usr/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201403041518_0023
14/03/05 12:41:52 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201403041518_0023
14/03/05 12:41:53 INFO streaming.StreamJob:  map 0%  reduce 0%
14/03/05 12:42:19 INFO streaming.StreamJob:  map 100%  reduce 100%
14/03/05 12:42:19 INFO streaming.StreamJob: To kill this job, run:
14/03/05 12:42:19 INFO streaming.StreamJob: /usr/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201403041518_0023
14/03/05 12:42:19 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201403041518_0023
14/03/05 12:42:19 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201403041518_0023_m_000000
14/03/05 12:42:19 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

以下是错误日志：

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

在不知道文件是如何组织的情况下，很难提供精确的解决方案

您可以使用Java文件的main方法

public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(calcAll.class);
    conf.setJobName("name");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(DoubleWritable.class); //or ObjectWritable, or whatever

    conf.setMapperClass(mapper.class);
    conf.setCombinerClass(reducer.class); //if your combiner is just a local reducer
    conf.setReducerClass(reducer.class);

    conf.setInputFormat(TextInputFormat.class); //assuming you are feeding it text
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));

    Path out1 = new Path(args[1]);
    FileOutputFormat.setOutputPath(conf, out1);

    JobClient.runJob(conf); // blocking call
}

然后你可以用一个sh文件运行这个东西，就像这样（如果你只做一个作业/通过，就删除out2）

如果运行作业后需要执行更多操作，可以将这些行添加到主菜单中

// the output is a set of files, merge them before continuing
Path out1Merged = new Path(args[2]);
Configuration config = new Configuration();
try {
    FileSystem hdfs = FileSystem.get(config);
    FileUtil.copyMerge(hdfs, out1, hdfs, out1Merged, false, config, null);
} catch (IOException e) {
    e.printStackTrace();
}

日志文件中应该有一条错误消息。如果没有这些，就很难判断问题出在哪里。在哪里可以找到日志文件？当您执行

java映射器

时，JRE将在其类路径中查找定义了类

mapper

的.class或.jar文件。因此，也许可以尝试编译您的源代码，然后调用

-file mapper.class-mapper“java mapper”

（对于Reducer也是如此）。您能告诉我怎么做吗？我对java了解不多，但在python中运行了与java相同的均衡，我不明白，您使用的是java还是python？除非扩展映射器类，否则映射器不是真正的映射器。旧的方法是扩展MapReduceBase并实现Mapper接口。你可以看看我没有使用hadoop common，那我为什么要使用它呢？你能重新检查我的方法吗？只需删除该库的名称，并用你正在使用的库替换它。

public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(calcAll.class);
    conf.setJobName("name");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(DoubleWritable.class); //or ObjectWritable, or whatever

    conf.setMapperClass(mapper.class);
    conf.setCombinerClass(reducer.class); //if your combiner is just a local reducer
    conf.setReducerClass(reducer.class);

    conf.setInputFormat(TextInputFormat.class); //assuming you are feeding it text
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));

    Path out1 = new Path(args[1]);
    FileOutputFormat.setOutputPath(conf, out1);

    JobClient.runJob(conf); // blocking call
}

#!/usr/bin/env bash

# Export environment variable
export HADOOP_HOME=/yourPathHere

# Remove old cruft
rm ClassWithMain.jar
rm -rf MyProject_classes

# Compile the task (check your Hadoop version and Apache lib path)
javac -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_HOME/share/hadoop/common/lib/commons-cli-1.2.jar -d MyProject_classes ClassWithMain.java
jar -cvf ClassWithMain.jar -C MyProject_classes/ .

# Abort if compilation failed
exitValue=$? 
if [ $exitValue != 0 ]; then
    exit $exitValue
fi

# File names
out1=out1-`date +%Y%m%d%H%M%S`
out2=out2-`date +%Y%m%d%H%M%S`

# Create an empty file
hadoop fs -touchz ./$out2

# Submit the 1st job
hadoop jar ClassWithMain.jar org.myorg.ClassWithMain /data ./$out1 ./$out1/merged ./$out2

# Display the results
hadoop fs -cat ./$out1/merged
hadoop fs -cat ./$out2

# Cleanup
hadoop fs -rm -r ./out*

// the output is a set of files, merge them before continuing
Path out1Merged = new Path(args[2]);
Configuration config = new Configuration();
try {
    FileSystem hdfs = FileSystem.get(config);
    FileUtil.copyMerge(hdfs, out1, hdfs, out1Merged, false, config, null);
} catch (IOException e) {
    e.printStackTrace();
}