Java 利用汉明距离算法结合map-reduce建立模糊连接索引_Java_Mapreduce_Hamming Distance

Java 利用汉明距离算法结合map-reduce建立模糊连接索引

java mapreduce

Java 利用汉明距离算法结合map-reduce建立模糊连接索引,java,mapreduce,hamming-distance,Java,Mapreduce,Hamming Distance,我不熟悉Java和Map Reduce，我正在尝试编写一个Map Reduce程序，读取程序中名为“dictionary”的列表单词，并使用汉明距离算法生成列表中距离为1的所有单词。我能够生成输出，但问题是它似乎非常低效，因为我需要将整个列表加载到ArrayList中，并且对于每个单词，我在Map方法中调用Hamming距离，因此我读取整个列表两次，并运行Hamming距离算法n*n次，其中n是列表中的单词数请给我建议一些有效的方法这是代码。现在还没有减速器 import java.io.B

我不熟悉Java和Map Reduce，我正在尝试编写一个Map Reduce程序，读取程序中名为“dictionary”的列表单词，并使用汉明距离算法生成列表中距离为1的所有单词。我能够生成输出，但问题是它似乎非常低效，因为我需要将整个列表加载到ArrayList中，并且对于每个单词，我在Map方法中调用Hamming距离，因此我读取整个列表两次，并运行Hamming距离算法n*n次，其中n是列表中的单词数

请给我建议一些有效的方法

这是代码。现在还没有减速器

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.StringTokenizer;

import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;

public class MapJoin {


    public static class MyMapper extends Mapper<LongWritable,Text, Text, Text> {


        private List<String> Lst = new ArrayList<String>();


        protected void setup(Context context) throws java.io.IOException, InterruptedException{
            Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());


            for (Path p : files) {
                if (p.getName().equals("dictionary.txt")) {
                    BufferedReader reader = new BufferedReader(new FileReader(p.toString()));
                    String line = reader.readLine();
                    while(line != null) {
                        String tokens = line.toString() ;

                        Lst.add(tokens);
                        line = reader.readLine();
                    }
                    reader.close();
                }
            }
            if (Lst.isEmpty()) {
                throw new IOException("Unable to load Abbrevation data.");
            }
        }


        public void map(LongWritable key, Text val, Context con)
                throws IOException,InterruptedException {

              String line1 = val.toString();
              StringTokenizer itr = new StringTokenizer(line1.toLowerCase()) ;
              while (itr.hasMoreTokens()) {

                  String key1 = itr.nextToken() ;  
                  String fnlstr = HammingDist(key1) ;

                      con.write(new Text(key1), new Text(fnlstr));

              }
            }


        private String HammingDist(String ky)
          {
              String result = "" ;
              for(String x :Lst)
              {
                  char[] s1 = ky.toCharArray();
                    char[] s2 = x.toCharArray();

                    int shorter = Math.min(s1.length, s2.length);
                    int longer = Math.max(s1.length, s2.length);

                    int distance = 0;
                    for (int i=0; i<shorter; i++) {
                        if (s1[i] != s2[i]) distance++;
                    }

                    distance += longer - shorter;

                    if (distance <2)
                    {
                        result = result +","+x ;
                    }
              }
              if(result == null) 
                  {
                  return "" ;
                  }
              else
              return result ;
          }
    }

  public static void main(String[] args) 
                  throws IOException, ClassNotFoundException, InterruptedException {

    Job job = new Job();
    job.setJarByClass(MapJoin.class);
    job.setJobName("MapJoin");
    job.setNumReduceTasks(0);

    try{
    DistributedCache.addCacheFile(new URI("/Input/dictionary.txt"), job.getConfiguration());
    }catch(Exception e){
        System.out.println(e);
    }

    job.setMapperClass(MyMapper.class);

    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);


  }
}

导入java.io.BufferedReader；
导入java.io.FileReader；
导入java.io.IOException；
导入java.net.URI；
导入java.util.ArrayList；
导入java.util.HashMap；
导入java.util.List；
导入java.util.StringTokenizer；
导入org.apache.hadoop.filecache.DistributedCache；
导入org.apache.hadoop.fs.Path；
导入org.apache.hadoop.io.LongWritable；
导入org.apache.hadoop.io.Text；
导入org.apache.hadoop.mapreduce.lib.input.FileInputFormat；
导入org.apache.hadoop.mapreduce.lib.output.FileOutputFormat；
导入org.apache.hadoop.mapreduce.Job；
导入org.apache.hadoop.mapreduce.Mapper；
公共类映射联接{
公共静态类MyMapper扩展了Mapper{
private List Lst=new ArrayList（）；
受保护的无效设置（上下文上下文）抛出java.io.IOException、InterruptedException{
Path[]files=DistributedCache.getLocalCacheFiles（context.getConfiguration（））；
用于（路径p：文件）{
if（p.getName（）.equals（“dictionary.txt”））{
BufferedReader=newbufferedReader（newfilereader（p.toString（））；
字符串行=reader.readLine（）；
while（行！=null）{
字符串标记=line.toString（）；
添加（代币）；
line=reader.readLine（）；
}
reader.close（）；
}
}
if（Lst.isEmpty（））{
抛出新IOException（“无法加载缩写数据”）；
}
}
公共无效映射（可长写键、文本值、上下文con）
抛出IOException、InterruptedException{
字符串line1=val.toString（）；
StringTokenizer itr=新的StringTokenizer（line1.toLowerCase（））；
而（itr.hasMoreTokens（））{
字符串key1=itr.nextToken（）；
字符串fnlstr=HammingDist（键1）；
con.write（新文本（键1），新文本（fnlstr））；
}
}
私有字符串汉明度（字符串ky）
{
字符串结果=”；
用于（字符串x:Lst）
{
char[]s1=ky.toCharArray（）；
char[]s2=x.toCharArray（）；
int shorter=数学最小值（s1.长度，s2.长度）；
int longer=数学最大值（s1.长度，s2.长度）；
整数距离=0；
对于（int i=0；i对于您在map中找到的每个标记，
您可以调用HammingDist
，它将迭代列表Lst
，并将每个项目转换为char[]。
最好将列表Lst
替换为列表，
并首先添加已转换的元素，
而不是反复转换相同的单词。
对于在映射中找到的每个标记，
您可以调用HammingDist
，它将迭代列表Lst
，并将每个项目转换为char[]。
最好将列表Lst
替换为列表，
并首先添加已转换的元素，
而不是一次又一次地转换相同的单词。
谢谢janos，所花费的时间已减少到原始代码的一半。谢谢janos，所花费的时间已减少到原始代码的一半。