Mapreduce 在map reduce程序中仅获取一个键的输出_Mapreduce

Mapreduce 在map reduce程序中仅获取一个键的输出

mapreduce

Mapreduce 在map reduce程序中仅获取一个键的输出,mapreduce,Mapreduce,我正试图写一个MapReduce程序来连接两个文本文件。我得到的输出，只针对其中一个键。例如，如果我有一个文件R.txt，其中数据为 a4 b3 a3 b4 另一个文件S.txt，数据为 b3 c3 b3 c1 b3 c2 b4 c4 我得到输出 a4 c2 a4 c1 a4 c3 而如果R.txt有 b4 c4 并且S.txt具有 a3 b4 输出为 a3和c4 这是我的节目 import java.io.IOException; import java.util.*; impor

我正试图写一个MapReduce程序来连接两个文本文件。我得到的输出，只针对其中一个键。例如，如果我有一个文件

R.txt

，其中数据为

a4 b3
a3 b4

另一个文件

S.txt

，数据为
b3 c3
b3 c1
b3 c2
b4 c4

我得到输出

a4 c2
a4 c1
a4 c3

而如果

R.txt

有
b4 c4

并且

S.txt

具有
a3 b4

输出为
a3和c4

这是我的节目

import java.io.IOException;
import java.util.*;     

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;  

public class RSJoin{
    public static class SMap extends Mapper<Object, Text, Text, Text>{ 
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String[] words = value.toString().split(" ");
            context.write(new Text(words[0]), new Text("S\t"+words[1]));
            }
}
    public static class RMap extends Mapper<Object, Text, Text, Text>{ 
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String[] words = value.toString().split(" ");
            context.write(new Text(words[1]), new Text("R\t"+words[0]));
            }
}

public static class Reduce extends Reducer<Text, Text, Text, Text> {
    public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        for (Text val : values) {
            String [] parts = val.toString().split("\t");
            String a=parts[0];
            if (a.equals("R")){
                for (Text val1 : values){
                String [] parts1=val1.toString().split("\t");
                String b=parts1[0];
                if (b.equals("S")){
                    context.write(new Text(parts[1]), new Text(parts1[1]));
                }
                }
            }
        }

  }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    @SuppressWarnings("deprecation")
    Job job = new Job(conf, "ReduceJoin");
    job.setJarByClass(RSJoin.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);


    job.setReducerClass(Reduce.class);
    MultipleInputs.addInputPath(job,new Path(args[0]),TextInputFormat.class,RMap.class);
    MultipleInputs.addInputPath(job,new Path(args[1]),TextInputFormat.class,SMap.class);


    job.setOutputFormatClass(TextOutputFormat.class);

    FileOutputFormat.setOutputPath(job, new Path(args[2]));

    job.waitForCompletion(true);

    }
}

import java.io.IOException；
导入java.util.*；
导入org.apache.hadoop.fs.Path；
导入org.apache.hadoop.conf.*；
导入org.apache.hadoop.io.*；
导入org.apache.hadoop.mapreduce.*；
导入org.apache.hadoop.mapreduce.lib.input.FileInputFormat；
导入org.apache.hadoop.mapreduce.lib.input.MultipleInputs；
导入org.apache.hadoop.mapreduce.lib.input.TextInputFormat；
导入org.apache.hadoop.mapreduce.lib.output.FileOutputFormat；
导入org.apache.hadoop.mapreduce.lib.output.TextOutputFormat；
公共类RSJoin{
公共静态类SMap扩展映射程序{
公共void映射（对象键、文本值、上下文上下文）引发IOException、InterruptedException{
String[]words=value.toString（）.split（“”）；
上下文。编写（新文本（单词[0]），新文本（“S\t”+单词[1]）；
}
}
公共静态类RMap扩展映射程序{
公共void映射（对象键、文本值、上下文上下文）引发IOException、InterruptedException{
String[]words=value.toString（）.split（“”）；
上下文。编写（新文本（单词[1]），新文本（“R\t”+单词[0]）；
}
}
公共静态类Reduce扩展Reducer{
公共void reduce（文本键、Iterable值、上下文上下文）引发IOException、InterruptedException{
用于（文本值：值）{
String[]parts=val.toString（）.split（“\t”）；
字符串a=零件[0]；
如果（a等于（“R”））{
用于（文本值1：值）{
字符串[]parts1=val1.toString（）.split（“\t”）；
字符串b=parts1[0]；
如果（b等于（“S”））{
编写（新文本（第[1]部分）、新文本（第1[1]部分））；
}
}
}
}
}
}
公共静态void main（字符串[]args）引发异常{
Configuration conf=新配置（）；
@抑制警告（“弃用”）
Job Job=新作业（conf，“ReduceJoin”）；
job.setJarByClass（RSJoin.class）；
job.setOutputKeyClass（Text.class）；
job.setOutputValueClass（Text.class）；
job.setReducerClass（Reduce.class）；
MultipleInputs.addInputPath（作业，新路径（args[0]），TextInputFormat.class，RMap.class）；
MultipleInputs.addInputPath（作业，新路径（args[1]），TextInputFormat.class，SMap.class）；
setOutputFormatClass（TextOutputFormat.class）；
setOutputPath（作业，新路径（args[2]）；
job.waitForCompletion（true）；
}
}

您的连接逻辑假定值列表中R值在S值之前。只有当你看到一个R时，你才会寻找一个S。值Iterable的内部从外部开始，如果S先出现，你的九循环就找不到它

如果多个S值只有一个R值，可以进行二次排序（在键中添加“R”和“S”，添加一个分区器并添加一个分组比较器-这是正确的方法），或者在找到R值后使用一个变量来保存R值，或者使用一个列表来保存S值，直到找到R值为止（实际上伸缩性不好）并在整个值集合中进行一次迭代。

您的连接逻辑假定R值在值列表中的S值之前。只有当你看到一个R时，你才会寻找一个S。值Iterable的内部从外部开始，如果S先出现，你的九循环就找不到它

我更改了reducer代码，如下所示，并获得了预期的输出

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
    List<String> listR = new ArrayList <String>();
    List<String> listS = new ArrayList <String>();
    for (Text val : values) {
        String [] parts = val.toString().split("\t");
        String a=parts[0];
        if (a.equals("R")){
            listR.add(parts[1]);    
        }
        else if (a.equals("S")){
            listS.add(parts[1]);
            }
        }
    for (String Temp: listR)
    {
        for (String Temp1: listS)
        {
            context.write(new Text(Temp), new Text(Temp1));
        }
    }

    }

public void reduce（文本键、Iterable值、上下文）抛出IOException、InterruptedException{
List listR=newarraylist（）；
List List=newarraylist（）；
用于（文本值：值）{
String[]parts=val.toString（）.split（“\t”）；
字符串a=零件[0]；
如果（a等于（“R”））{
列表添加（第[1]部分）；
}
否则，如果（a等于（“S”））{
列表。添加（第[1]部分）；
}
}
用于（字符串温度：listR）
{
for（字符串Temp1:列表）
{
编写（新文本（Temp），新文本（Temp1））；
}
}
}

我更改了减速器代码，如下所示，并获得了预期的输出

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
    List<String> listR = new ArrayList <String>();
    List<String> listS = new ArrayList <String>();
    for (Text val : values) {
        String [] parts = val.toString().split("\t");
        String a=parts[0];
        if (a.equals("R")){
            listR.add(parts[1]);    
        }
        else if (a.equals("S")){
            listS.add(parts[1]);
            }
        }
    for (String Temp: listR)
    {
        for (String Temp1: listS)
        {
            context.write(new Text(Temp), new Text(Temp1));
        }
    }

    }

public void reduce（文本键、Iterable值、上下文）抛出IOException、InterruptedException{
List listR=newarraylist（）；
List List=newarraylist（）；
用于（文本值：值）{
String[]parts=val.toString（）.split（“\t”）；
字符串a=零件[0]；
如果（a等于（“R”））{
列表添加（第[1]部分）；
}
否则，如果（a等于（“S”））{
列表。添加（第[1]部分）；
}
}
用于（字符串温度：listR）
{
for（字符串Temp1:列表）
{
编写（新文本（Temp），新文本（Temp1））；
}
}
}

您的预期输出是什么？对于第一个示例，输出应该是a4 c2、a4 c1、a4 c3、a3 c4。我得到的输出是a4 c2，a4 c1，a4 c3。连接不会发生在b4上，而只发生在b3上。我希望代码能像自然连接一样工作