Java 如何将旧的api mapreduce作业代码转换为新的mapreduce
下面的代码来自Alex Holmes Hadoop在Practice Ver-2中的代码: 链接: 此mapreduce代码的映射器从文本文件读取URL列表,发送HTTP请求并将正文内容存储到文本文件中 然而,这段代码是基于旧的MapReduceAPI编写的,我想转换成新版本的MapReduceAPI。它可能很简单,比如将JobConf更改为Job+配置并扩展新的映射器,但由于某些原因,我无法使它与我的代码一起工作 我宁愿等待发布修改后的代码以避免混淆,但原始代码如下所述: 映射程序代码:Java 如何将旧的api mapreduce作业代码转换为新的mapreduce,java,http,hadoop,mapreduce,Java,Http,Hadoop,Mapreduce,下面的代码来自Alex Holmes Hadoop在Practice Ver-2中的代码: 链接: 此mapreduce代码的映射器从文本文件读取URL列表,发送HTTP请求并将正文内容存储到文本文件中 然而,这段代码是基于旧的MapReduceAPI编写的,我想转换成新版本的MapReduceAPI。它可能很简单,比如将JobConf更改为Job+配置并扩展新的映射器,但由于某些原因,我无法使它与我的代码一起工作 我宁愿等待发布修改后的代码以避免混淆,但原始代码如下所述: 映射程序代码: im
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import java.net.URLConnection;
public final class HttpDownloadMap
implements Mapper<LongWritable, Text, Text, Text> {
private int file = 0;
private Configuration conf;
private String jobOutputDir;
private String taskId;
private int connTimeoutMillis =
DEFAULT_CONNECTION_TIMEOUT_MILLIS;
private int readTimeoutMillis = DEFAULT_READ_TIMEOUT_MILLIS;
private final static int DEFAULT_CONNECTION_TIMEOUT_MILLIS = 5000;
private final static int DEFAULT_READ_TIMEOUT_MILLIS = 5000;
public static final String CONN_TIMEOUT =
"httpdownload.connect.timeout.millis";
public static final String READ_TIMEOUT =
"httpdownload.read.timeout.millis";
@Override
public void configure(JobConf job) {
conf = job;
jobOutputDir = job.get("mapred.output.dir");
taskId = conf.get("mapred.task.id");
if (conf.get(CONN_TIMEOUT) != null) {
connTimeoutMillis = Integer.valueOf(conf.get(CONN_TIMEOUT));
}
if (conf.get(READ_TIMEOUT) != null) {
readTimeoutMillis = Integer.valueOf(conf.get(READ_TIMEOUT));
}
}
@Override
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
Path httpDest =
new Path(jobOutputDir, taskId + "_http_" + (file++));
InputStream is = null;
OutputStream os = null;
try {
URLConnection connection =
new URL(value.toString()).openConnection();
connection.setConnectTimeout(connTimeoutMillis);
connection.setReadTimeout(readTimeoutMillis);
is = connection.getInputStream();
os = FileSystem.get(conf).create(httpDest);
IOUtils.copyBytes(is, os, conf, true);
} finally {
IOUtils.closeStream(is);
IOUtils.closeStream(os);
}
output.collect(new Text(httpDest.toString()), value);
}
@Override
public void close() throws IOException {
}
}
运行配置:
args[0] = "testData/input/urls.txt"
args[1] = "testData/output"
URL.txt包含:
http://www.google.com
http://www.yahoo.com
尝试以下更改:
org.apache.hadoop.mapreduce
包而不是映射的包OutputCollector
和Reporter
更改为Context
,因为新API使用Context
对象进行写入JobClient
更改为Job
,将JobConf
更改为Configuration
Mapper类:Job Runner类:-代码链接将在1年内过期。您好,Nilay,我确实根据您的建议修改了代码,但我无法使其按应有的方式工作。你能帮我调试一下吗,代码是上传到这些外部链接的。映射器类:&作业运行器类:
http://www.google.com
http://www.yahoo.com