Java Hadoop DistributedCache已弃用-首选的API是什么？_Java_Hadoop_Mapreduce

Java Hadoop DistributedCache已弃用-首选的API是什么？

java hadoop mapreduce

Java Hadoop DistributedCache已弃用-首选的API是什么？,java,hadoop,mapreduce,Java,Hadoop,Mapreduce,我的映射任务需要一些配置数据，我希望通过分布式缓存分发这些数据 Hadoop显示了DistributedCache类的名称，大致如下所示： // In the driver JobConf conf = new JobConf(getConf(), WordCount.class); ... DistributedCache.addCacheFile(new Path(filename).toUri(), conf); // In the mapper Path[] myCacheFiles

我的映射任务需要一些配置数据，我希望通过分布式缓存分发这些数据

Hadoop显示了DistributedCache类的名称，大致如下所示：

// In the driver
JobConf conf = new JobConf(getConf(), WordCount.class);
...
DistributedCache.addCacheFile(new Path(filename).toUri(), conf); 

// In the mapper
Path[] myCacheFiles = DistributedCache.getLocalCacheFiles(job);
...

但是，

DistributedCache

在Hadoop 2.2.0中

实现这一目标的新的首选方式是什么？是否有介绍此API的最新示例或教程？

分布式缓存的API可以在作业类本身中找到。检查此处的文档：代码应该类似于

Job job = new Job();
...
job.addCacheFile(new Path(filename).toUri());

在映射程序代码中：

Path[] localPaths = context.getLocalCacheFiles();
...

在

org.apache.hadoop.mapreduce.Job

类中可以找到用于纱线/MR2的新DistributedCache API

   Job.addCacheFile()

不幸的是，目前还没有很多全面的教程风格的例子

要在@jtravaglini上进行扩展，对纱线/MapReduce 2使用

DistributedCache

的首选方法如下：

// In the driver
JobConf conf = new JobConf(getConf(), WordCount.class);
...
DistributedCache.addCacheFile(new Path(filename).toUri(), conf); 

// In the mapper
Path[] myCacheFiles = DistributedCache.getLocalCacheFiles(job);
...

在驱动程序中，使用

Job.addCacheFile（）

在映射器/还原器中，覆盖

设置（上下文）

方法：

@Override
protected void setup(
        Mapper<LongWritable, Text, Text, Text>.Context context)
        throws IOException, InterruptedException {
    if (context.getCacheFiles() != null
            && context.getCacheFiles().length > 0) {

        File some_file = new File("./some");
        File other_file = new File("./other");

        // Do things to these two files, like read them
        // or parse as JSON or whatever.
    }
    super.setup(context);
}

@覆盖
受保护的无效设置(
Mapper.Context（上下文）
抛出IOException、InterruptedException{
如果（context.getCacheFiles（）！=null
&&context.getCacheFiles（）.length>0）{
File some_File=新文件（“./some”）；
文件其他文件=新文件（“/其他”）；
//对这两个文件执行一些操作，比如读取它们
//或者解析为JSON或者其他什么。
}
超级设置（上下文）；
}

我也有同样的问题。不仅DistributedCach已被弃用，而且getLocalCacheFiles和“新作业”也被弃用。因此，对我起作用的是：

司机：

Configuration conf = getConf();
Job job = Job.getInstance(conf);
...
job.addCacheFile(new Path(filename).toUri());

在Mapper/Reducer设置中：

@Override
protected void setup(Context context) throws IOException, InterruptedException
{
    super.setup(context);

    URI[] files = context.getCacheFiles(); // getCacheFiles returns null

    Path file1path = new Path(files[0])
    ...
}

我没有使用job.addCacheFile（）。相反，我像以前一样使用了-files选项，比如“-files/path/to/myfile.txt#myfile”。然后在mapper或reducer代码中，我使用以下方法：

/**
 * This method can be used with local execution or HDFS execution. 
 * 
 * @param context
 * @param symLink
 * @param throwExceptionIfNotFound
 * @return
 * @throws IOException
 */
public static File findDistributedFileBySymlink(JobContext context, String symLink, boolean throwExceptionIfNotFound) throws IOException
{
    URI[] uris = context.getCacheFiles();
    if(uris==null||uris.length==0)
    {
        if(throwExceptionIfNotFound)
            throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
        return null;
    }
    URI symlinkUri = null;
    for(URI uri: uris)
    {
        if(symLink.equals(uri.getFragment()))
        {
            symlinkUri = uri;
            break;
        }
    }   
    if(symlinkUri==null)
    {
        if(throwExceptionIfNotFound)
            throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
        return null;
    }
    //if we run this locally the file system URI scheme will be "file" otherwise it should be a symlink
    return "file".equalsIgnoreCase(FileSystem.get(context.getConfiguration()).getScheme())?(new File(symlinkUri.getPath())):new File(symLink);

}

然后在mapper/reducer中：

@Override
protected void setup(Context context) throws IOException, InterruptedException
{
    super.setup(context);

    File file = HadoopUtils.findDistributedFileBySymlink(context,"myfile",true);
    ... do work ...
}

请注意，如果我直接使用“-files/path/to/myfile.txt”，那么我需要使用“myfile.txt”来访问该文件，因为这是默认的符号链接名称。

所提到的任何解决方案都不完全适用于我。这可能是因为Hadoop版本不断变化，我使用的是Hadoop 2.6.4。实际上，DistributedCache已被弃用，所以我不想使用它。然而，正如一些帖子建议我们使用addCacheFile（）一样，它已经发生了一些变化。这就是它对我的作用

job.addCacheFile(new URI("hdfs://X.X.X.X:9000/EnglishStop.txt#EnglishStop.txt"));

这里X.X.X.X可以是主IP地址或本地主机。EnglishStop.txt存储在HDFS的/location中

hadoop fs -ls /

输出是

-rw-r--r--   3 centos supergroup       1833 2016-03-12 20:24 /EnglishStop.txt
drwxr-xr-x   - centos supergroup          0 2016-03-12 19:46 /test

有趣但方便，#EnglishStop.txt意味着现在我们可以在mapper中以“EnglishStop.txt”的形式访问它。下面是相同的代码

public void setup(Context context) throws IOException, InterruptedException     
{
    File stopwordFile = new File("EnglishStop.txt");
    FileInputStream fis = new FileInputStream(stopwordFile);
    BufferedReader reader = new BufferedReader(new InputStreamReader(fis));

    while ((stopWord = reader.readLine()) != null) {
        // stopWord is a word read from Cache
    }
}

这对我来说很有效。您可以从存储在HDFS中的文件中读取行，谢谢-我假设我因此需要使用较新的

mapreduce

API而不是

mapred

，否则

JobContext

对象不会提供给映射器…我认为

getLocalCacheFiles（）

不推荐使用，但

getCacheFiles（）

正常-返回URI而不是路径。很好！这是一个比使用DistributedCache更干净、更简单的API。@DNA我认为

getLocalCacheFiles（）

和

getCacheFiles（）

是不同的。你可以检查我的问题（）。如果您想访问本地化文件但不想使用不推荐使用的api，您可以使用文件名直接打开它（背后的技术称为符号链接）。但是如果我们使用一些框架（如级联）来创建作业呢？我们只能将jobconf传递给级联框架-在这种情况下，DistributedCache的替代方案是什么？我不知道如何检索使用

Job.addCacheFile（URI）

添加的这些缓存文件。使用旧方法（

context.getCacheFiles（）

）对我不起作用，因为这些文件是空的。这是在哪里记录的？