Hadoop 使用分布式缓存读取文件_Hadoop_Mapreduce_Distributed Caching

Hadoop 使用分布式缓存读取文件

hadoop mapreduce

Hadoop 使用分布式缓存读取文件,hadoop,mapreduce,distributed-caching,Hadoop,Mapreduce,Distributed Caching,我在分布式缓存中存储了很多文件，每个文件对应一个用户id。我想将一个特定的文件附加到一个特定的reduce任务中，该文件对应于一个特定的用户id（它将是reducer的键）。但是我不能这样做，因为我使用configure方法从分布式缓存中读取文件，它位于reduce类中reduce方法之前。因此，我无法在reduce类的configure方法中访问reduce方法的键，因此无法仅读取我想要的文件。请帮帮我 class reduce{ void configure(args) { /*I ca

我在分布式缓存中存储了很多文件，每个文件对应一个用户id。我想将一个特定的文件附加到一个特定的reduce任务中，该文件对应于一个特定的用户id（它将是reducer的键）。但是我不能这样做，因为我使用configure方法从分布式缓存中读取文件，它位于reduce类中reduce方法之前。因此，我无法在reduce类的configure方法中访问reduce方法的键，因此无法仅读取我想要的文件。请帮帮我

class reduce{

void configure(args)
{

/*I can a particular file from the Path[] here.
I want to select the  file corresponding to the key of the reduce method and pass its
contents to the reduce method. I am not able to do this as I can't access the key of 
the reduce method.*/

}

void reduce(args)
{
}


}

一种解决方案是在配置步骤中将DistributedCache中的

路径

数组分配给类变量，如DistributedCache中所述。当然，用reduce代码替换map代码

这是使用旧的API，它看起来像您的代码正在使用的API

 public static class MapClass extends MapReduceBase  
 implements Mapper<K, V, K, V> {

   private Path[] localArchives;
   private Path[] localFiles;

   public void configure(JobConf job) {
     // Get the cached archives/files
     localArchives = DistributedCache.getLocalCacheArchives(job);
     localFiles = DistributedCache.getLocalCacheFiles(job);
   }

   public void map(K key, V value, 
                   OutputCollector<K, V> output, Reporter reporter) 
   throws IOException {
     // Use data from the cached archives/files here
     // ...
     // ...
     output.collect(k, v);
   }
 }

公共静态类MapClass扩展了MapReduceBase
实现映射器{
私有路径[]本地存档；
私有路径[]本地文件；
公共无效配置（JobConf作业）{
//获取缓存的存档/文件
localArchives=DistributedCache.getLocalCacheArchives（作业）；
localFiles=DistributedCache.getLocalCacheFiles（作业）；
}
公共空隙图（K键，V值，
OutputCollector输出，报告器（报告器）
抛出IOException{
//在此处使用缓存的存档/文件中的数据
// ...
// ...
输出。收集（k，v）；
}
}