Hadoop Wordcount文件的常用字_Hadoop_Mapreduce_Word Count

Hadoop Wordcount文件的常用字

hadoop mapreduce

Hadoop Wordcount文件的常用字,hadoop,mapreduce,word-count,Hadoop,Mapreduce,Word Count,我已经设法在非分布式模式下运行Hadoop wordcount示例；我在一个名为“part-00000”的文件中获得输出；我可以看到它列出了所有输入文件的所有单词在跟踪wordcount代码之后，我可以看到它采用了行并基于空格分割单词我在想一种方法，只列出在多个文件中出现的单词及其出现的次数？这可以在Map/Reduce中实现吗？ -增加- 这些改变合适吗 //changes in the parameters here public static class Map

我已经设法在非分布式模式下运行Hadoop wordcount示例；我在一个名为“part-00000”的文件中获得输出；我可以看到它列出了所有输入文件的所有单词

在跟踪wordcount代码之后，我可以看到它采用了行并基于空格分割单词

我在想一种方法，只列出在多个文件中出现的单词及其出现的次数？这可以在Map/Reduce中实现吗？ -增加- 这些改变合适吗

      //changes in the parameters here

    public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {

         // These are the original line; I am not using them but left them here...
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();

                    //My changes are here too

        private Text outvalue=new Text();
        FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
        private String filename = fileSplit.getPath().getName();;



      public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());

          //    And here        
              outvalue.set(filename);
          output.collect(word, outvalue);

        }

      }

    }

//此处参数的更改
公共静态类映射扩展MapReduceBase实现映射器{
//这些是原始行；我没有使用它们，而是将它们留在这里。。。
私有最终静态IntWritable one=新的IntWritable（1）；
私有文本字=新文本（）；
//我的零钱也在这里
私有文本输出值=新文本（）；
FileSplit FileSplit=（FileSplit）reporter.getInputSplit（）；
私有字符串文件名=fileSplit.getPath（）.getName（）；；
公共void映射（文本键、文本值、OutputCollector输出、报告器报告器）引发IOException{
字符串行=value.toString（）；
StringTokenizer标记器=新的StringTokenizer（行）；
while（tokenizer.hasMoreTokens（））{
set（tokenizer.nextToken（））；
//这里呢
outvalue.set（文件名）；
输出。收集（字、输出值）；
}
}
}

您可以修改映射器，将单词作为关键字输出，然后将文本作为表示单词来源文件名的值。然后，在reducer中，您只需要删除文件名的重复数据，并在单词出现在多个文件中时输出这些条目

获取正在处理的文件的文件名取决于您是否正在使用新API（mapred或mapreduce包名）。我知道对于新的API，您可以使用该方法从上下文对象提取映射器输入拆分（然后假设您使用的是

TextInputFormat

，则可能将

inputslit

转换为

FileSplit

）。对于旧的API，我从未尝试过，但显然您可以使用名为

map.input.file

对于引入组合器（Combiner）以从同一映射器中消除多个重复出现的单词，这也是一个不错的选择

更新

为了解决您的问题，您尝试使用一个名为reporter的实例变量，该变量在mapper的类scopt中不存在，请修改如下：

public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {
  // These are the original line; I am not using them but left them here...
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  //My changes are here too
  private Text outvalue=new Text();
  private String filename = null;

  public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
    if (filename == null) {
      filename = ((FileSplit) reporter.getInputSplit()).getPath().getName();
    }

    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      word.set(tokenizer.nextToken());

      //    And here        
      outvalue.set(filename);
      output.collect(word, outvalue);
    }
  }
}

公共静态类映射扩展MapReduceBase实现映射器{
//这些是原始行；我没有使用它们，而是将它们留在这里。。。
私有最终静态IntWritable one=新的IntWritable（1）；
私有文本字=新文本（）；
//我的零钱也在这里
私有文本输出值=新文本（）；
私有字符串文件名=null；
公共void映射（文本键、文本值、OutputCollector输出、报告器报告器）引发IOException{
如果（文件名==null）{
filename=（（FileSplit）reporter.getInputSplit（））.getPath（）.getName（）；
}
字符串行=value.toString（）；
StringTokenizer标记器=新的StringTokenizer（行）；
while（tokenizer.hasMoreTokens（））{
set（tokenizer.nextToken（））；
//这里呢
outvalue.set（文件名）；
输出。收集（字、输出值）；
}
}
}

（真的不知道为什么不遵守上面的格式…

谢谢你，克里斯…请告诉我怎么做好吗？我向wordcount映射类添加了以下行：FileSplit FileSplit=（FileSplit）reporter.getInputSplit（）；私有字符串文件名=fileSplit.getPath（）.getName（）；；并在输出的while循环中。收集以下内容（单词、文件名）。到目前为止我所做的对吗？作为获取word current文件的第一步…我目前正在使用Hadoop 0.20.2。听起来不错，试一试吧（仅供参考，即使您使用的是0.20.2，您仍在使用旧API）我尝试运行它，将IntWritable更改为Text..etc..I出现以下错误：WordCount.java:21:找不到符号：变量报告器位置：class org.myorg.WordCount.Map FileSplit FileSplit=（FileSplit）reporter.getInputSplit（）；有什么想法吗？你能把修改后的映射代码粘贴回你原来的问题吗