Java WordCount示例，包含每个文件的计数_Java_Apache_Hadoop_Mapreduce

Java WordCount示例，包含每个文件的计数

java apache hadoop mapreduce

Java WordCount示例，包含每个文件的计数,java,apache,hadoop,mapreduce,Java,Apache,Hadoop,Mapreduce,我有一个问题，以获得每个文件中出现的单词总数的细目。例如，我有四个文本文件（t1、t2、t3、t4）。单词w1在文件t2中出现两次，在t4中出现一次，总共出现三次。我想在输出文件中写入相同的信息。我得到了每个文件中的总字数，但无法得到如上所述的结果这是我的地图课 import java.io.IOException; import java.util.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce

我有一个问题，以获得每个文件中出现的单词总数的细目。例如，我有四个文本文件（t1、t2、t3、t4）。单词w1在文件t2中出现两次，在t4中出现一次，总共出现三次。我想在输出文件中写入相同的信息。我得到了每个文件中的总字数，但无法得到如上所述的结果

这是我的地图课

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
//line added
import org.apache.hadoop.mapreduce.lib.input.*;

public class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private String pattern= "^[a-z][a-z0-9]*$";

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    //line added
    InputSplit inputSplit = context.getInputSplit();
    String fileName = ((FileSplit) inputSplit).getPath().getName();

    while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        String stringWord = word.toString().toLowerCase();
        if ((stringWord).matches(pattern)){
            //context.write(new Text(stringWord), one);
            context.write(new Text(stringWord), one);
            context.write(new Text(fileName), one);
            //System.out.println(fileName);
            }
        }
    }
}

import java.io.IOException；
导入java.util.*；
导入org.apache.hadoop.io.*；
导入org.apache.hadoop.mapreduce.*；
//行添加
导入org.apache.hadoop.mapreduce.lib.input.*；
公共类映射扩展映射器{
私有最终静态IntWritable one=新的IntWritable（1）；
私有文本字=新文本（）；
私有字符串模式=“^[a-z][a-z0-9]*$”；
公共void映射（LongWritable键、文本值、上下文上下文）引发IOException、InterruptedException{
字符串行=value.toString（）；
StringTokenizer标记器=新的StringTokenizer（行）；
//行添加
InputSplit InputSplit=context.getInputSplit（）；
字符串文件名=（（FileSplit）inputSplit.getPath（）.getName（）；
while（tokenizer.hasMoreTokens（））{
set（tokenizer.nextToken（））；
字符串stringWord=word.toString（）.toLowerCase（）；
if（（stringWord）.matches（模式））{
//编写（新文本（stringWord），一个）；
编写（新文本（stringWord），一个）；
context.write（新文本（文件名），一个）；
//System.out.println（文件名）；
}
}
}
}

这可以通过将

word

作为

键

和

文件名

作为

值来实现。现在，在您的reducer中，为每个文件初始化单独的计数器并更新它们。对特定键迭代所有值后，将每个文件的计数器写入上下文
这里您知道您只有四个文件，所以您可以硬编码四个变量。请记住，您需要为在reducer中处理的每个新密钥重置变量
如果文件数量较多，则可以使用Map。在映射中，文件名
将被设置为键
，并不断更新映射器输出中的值
，我们可以将文本文件名设置为键，将文件中的每一行设置为值。这个减速机为您提供文件名、单词及其对应的计数
public class Reduce extends Reducer<Text, Text, Text, Text> {
    HashMap<String, Integer>input = new HashMap<String, Integer>();

    public void reduce(Text key, Iterable<Text> values , Context context)
    throws IOException, InterruptedException {
        int sum = 0;
        for(Text val: values){
            String word = val.toString(); -- processing each row
            String[] wordarray = word.split(' '); -- assuming the delimiter is a space
            for(int i=0 ; i<wordarray.length; i++)
           {
            if(input.get(wordarray[i]) == null){
            input.put(wordarray[i],1);}
            else{
             int value =input.get(wordarray[i]) +1 ; 
             input.put(wordarray[i],value);
             }
           }     

       context.write(new Text(key), new Text(input.toString()));
    }

公共类Reduce扩展Reducer{
HashMapinput=新建HashMap（）；
公共void reduce（文本键、Iterable值、上下文）
抛出IOException、InterruptedException{
整数和=0；
用于（文本值：值）{
字符串word=val.toString（）；--处理每一行
String[]wordarray=word.split（“”）；--假设分隔符是一个空格
对于（int i=0；ii如果希望每个文件都有单独的结果，则运行作业四次。如果希望结果合并，则提供所有文件作为输入，则需要使用多个输入。结果的第一部分是确定的（即所有文件中所有字的总出现次数）。但是我想要文件名的分解。例如，w1:3次（t2 x两次，t1 x一次）嗨，谢谢。在Map类中，我不能将文件名作为变量传递。它说，方法write（text，string）不适用于参数（text，IntWritable）更改mapper类的那一行…..公共类Map扩展了mapper{….并在context.write（新文本（文件名）、新文本（行）。。。。