Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/http/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 如何将旧的api mapreduce作业代码转换为新的mapreduce_Java_Http_Hadoop_Mapreduce - Fatal编程技术网

Java 如何将旧的api mapreduce作业代码转换为新的mapreduce

Java 如何将旧的api mapreduce作业代码转换为新的mapreduce,java,http,hadoop,mapreduce,Java,Http,Hadoop,Mapreduce,下面的代码来自Alex Holmes Hadoop在Practice Ver-2中的代码: 链接: 此mapreduce代码的映射器从文本文件读取URL列表,发送HTTP请求并将正文内容存储到文本文件中 然而,这段代码是基于旧的MapReduceAPI编写的,我想转换成新版本的MapReduceAPI。它可能很简单,比如将JobConf更改为Job+配置并扩展新的映射器,但由于某些原因,我无法使它与我的代码一起工作 我宁愿等待发布修改后的代码以避免混淆,但原始代码如下所述: 映射程序代码: im

下面的代码来自Alex Holmes Hadoop在Practice Ver-2中的代码: 链接:

此mapreduce代码的映射器从文本文件读取URL列表,发送HTTP请求并将正文内容存储到文本文件中

然而,这段代码是基于旧的MapReduceAPI编写的,我想转换成新版本的MapReduceAPI。它可能很简单,比如将JobConf更改为Job+配置并扩展新的映射器,但由于某些原因,我无法使它与我的代码一起工作

我宁愿等待发布修改后的代码以避免混淆,但原始代码如下所述:

映射程序代码:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import java.net.URLConnection;

public final class HttpDownloadMap
    implements Mapper<LongWritable, Text, Text, Text> {
  private int file = 0;
  private Configuration conf;
  private String jobOutputDir;
  private String taskId;
  private int connTimeoutMillis =
      DEFAULT_CONNECTION_TIMEOUT_MILLIS;
  private int readTimeoutMillis = DEFAULT_READ_TIMEOUT_MILLIS;
  private final static int DEFAULT_CONNECTION_TIMEOUT_MILLIS = 5000;
  private final static int DEFAULT_READ_TIMEOUT_MILLIS = 5000;

  public static final String CONN_TIMEOUT =
      "httpdownload.connect.timeout.millis";

  public static final String READ_TIMEOUT =
      "httpdownload.read.timeout.millis";

  @Override
  public void configure(JobConf job) {
    conf = job;
    jobOutputDir = job.get("mapred.output.dir");
    taskId = conf.get("mapred.task.id");

    if (conf.get(CONN_TIMEOUT) != null) {
      connTimeoutMillis = Integer.valueOf(conf.get(CONN_TIMEOUT));
    }
    if (conf.get(READ_TIMEOUT) != null) {
      readTimeoutMillis = Integer.valueOf(conf.get(READ_TIMEOUT));
    }
  }

  @Override
  public void map(LongWritable key, Text value,
                  OutputCollector<Text, Text> output,
                  Reporter reporter) throws IOException {
    Path httpDest =
        new Path(jobOutputDir, taskId + "_http_" + (file++));

    InputStream is = null;
    OutputStream os = null;
    try {
      URLConnection connection =
          new URL(value.toString()).openConnection();
      connection.setConnectTimeout(connTimeoutMillis);
      connection.setReadTimeout(readTimeoutMillis);
      is = connection.getInputStream();

      os = FileSystem.get(conf).create(httpDest);

      IOUtils.copyBytes(is, os, conf, true);
    } finally {
      IOUtils.closeStream(is);
      IOUtils.closeStream(os);
    }

    output.collect(new Text(httpDest.toString()), value);
  }

  @Override
  public void close() throws IOException {
  }
}
运行配置:

args[0] = "testData/input/urls.txt"
args[1] = "testData/output"
URL.txt包含:

http://www.google.com 
http://www.yahoo.com
尝试以下更改:

  • 导入
    org.apache.hadoop.mapreduce
    包而不是映射的包

  • 将旧的
    OutputCollector
    Reporter
    更改为
    Context
    ,因为新API使用
    Context
    对象进行写入

  • JobClient
    更改为
    Job
    ,将
    JobConf
    更改为
    Configuration


  • Mapper类:Job Runner类:-代码链接将在1年内过期。您好,Nilay,我确实根据您的建议修改了代码,但我无法使其按应有的方式工作。你能帮我调试一下吗,代码是上传到这些外部链接的。映射器类:&作业运行器类:
    http://www.google.com 
    http://www.yahoo.com