Java 使用MapReduce进行规范化_Java_Hadoop_Mapreduce

Java 使用MapReduce进行规范化

java hadoop mapreduce

Java 使用MapReduce进行规范化,java,hadoop,mapreduce,Java,Hadoop,Mapreduce,有这个样本记录,， 100,1:2:3 我想将其标准化为， 100,1 100,2 100,3 我的一位同事编写了一个pig脚本来实现这一点，我的MapReduce代码花费了更多的时间。我以前使用默认的TextInputformat。但为了提高性能，我决定编写一个带有自定义RecordReader的自定义输入格式类。以LineRecordReader类为参考，我尝试编写以下代码 import java.io.IOException; import java.util.List; import

有这个样本记录,， 100,1:2:3

我想将其标准化为，
100,1
100,2
100,3

我的一位同事编写了一个pig脚本来实现这一点，我的MapReduce代码花费了更多的时间。我以前使用默认的TextInputformat。但为了提高性能，我决定编写一个带有自定义RecordReader的自定义输入格式类。以LineRecordReader类为参考，我尝试编写以下代码

import java.io.IOException;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;

import com.normalize.util.Splitter;

public class NormalRecordReader extends RecordReader<Text, Text> {

    private long start;
    private long pos;
    private long end;
    private LineReader in;
    private int maxLineLength;
    private Text key = null;
    private Text value = null;
    private Text line = null;

    public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
        FileSplit split = (FileSplit) genericSplit;
        Configuration job = context.getConfiguration();
        this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);

        start = split.getStart();
        end = start + split.getLength();

        final Path file = split.getPath();

        FileSystem fs = file.getFileSystem(job);
        FSDataInputStream fileIn = fs.open(split.getPath());

        in = new LineReader(fileIn, job);
        this.pos = start;
    }

    public boolean nextKeyValue() throws IOException {
        int newSize = 0;
        if (line == null) {
            line = new Text();
        }

        while (pos < end) {
            newSize = in.readLine(line);
            if (newSize == 0) {
                break;
            }
            pos += newSize;
            if (newSize < maxLineLength) {
                break;
            }

            // line too long. try again
            System.out.println("Skipped line of size " + newSize + " at pos " + (pos - newSize));
        }
        Splitter splitter = new Splitter(line.toString(), ",");
        List<String> split = splitter.split();

        if (key == null) {
            key = new Text();
        }
        key.set(split.get(0));

        if (value == null) {
            value = new Text();
        }
        value.set(split.get(1));

        if (newSize == 0) {
            key = null;
            value = null;
            return false;

        } else {
            return true;
        }
    }

    @Override
    public Text getCurrentKey() {
        return key;
    }

    @Override
    public Text getCurrentValue() {
        return value;
    }

    /**
     * Get the progress within the split
     */
    public float getProgress() {
        if (start == end) {
            return 0.0f;
        } else {
            return Math.min(1.0f, (pos - start) / (float)(end - start));
        }
    }

    public synchronized void close() throws IOException {
        if (in != null) {
            in.close(); 
        }
    }
}

import java.io.IOException；
导入java.util.List；
导入org.apache.hadoop.conf.Configuration；
导入org.apache.hadoop.fs.FSDataInputStream；
导入org.apache.hadoop.fs.FileSystem；
导入org.apache.hadoop.fs.Path；
导入org.apache.hadoop.io.Text；
导入org.apache.hadoop.mapreduce.InputSplit；
导入org.apache.hadoop.mapreduce.RecordReader；
导入org.apache.hadoop.mapreduce.TaskAttemptContext；
导入org.apache.hadoop.mapreduce.lib.input.FileSplit；
导入org.apache.hadoop.util.LineReader；
导入com.normalize.util.Splitter；
公共类NormalRecordReader扩展了RecordReader{
私人长期启动；
私人长pos；
私人长尾；
专用线路阅读器；
私有整数maxLineLength；
私钥=null；
私有文本值=null；
私有文本行=null；
public void initialize（InputSplit genericSplit，TaskAttemptContext上下文）引发IOException{
FileSplit split=（FileSplit）genericSplit；
配置作业=context.getConfiguration（）；
this.maxLineLength=job.getInt（“mapred.linerecordreader.maxlength”，Integer.MAX_值）；
start=split.getStart（）；
end=start+split.getLength（）；
最终路径文件=split.getPath（）；
FileSystem fs=file.getFileSystem（作业）；
FSDataInputStream fileIn=fs.open（split.getPath（））；
in=新的行读取器（文件输入，作业）；
this.pos=开始；
}
公共布尔值nextKeyValue（）引发IOException{
int newSize=0；
如果（行==null）{
行=新文本（）；
}
while（pos


虽然这很有效，但我没有看到任何性能改进。在这里，我打破了在“”的记录，并将100设置为键，1,2,3设置为值。我只调用执行以下操作的映射器：
public void map(Text key, Text value, Context context) 
        throws IOException, InterruptedException {

    try {
        Splitter splitter = new Splitter(value.toString(), ":");
        List<String> splits = splitter.split();

        for (String split : splits) {
            context.write(key, new Text(split));
        }

    } catch (IndexOutOfBoundsException ibe) {
        System.err.println(value + " is malformed.");
    }
}

private final Text text = new Text();

public void map(Text key, Text value, Context context) {
    ....
    for (String split : splits) {
        text.set(split);
        context.write(key, text);
    }
}

public void映射（文本键、文本值、上下文）
抛出IOException、InterruptedException{
试一试{
拆分器拆分器=新拆分器（value.toString（），“：”；
List splits=splitter.split（）；
用于（字符串拆分：拆分）{
编写（键，新文本（拆分））；
}
}捕获（IndexOutOfBoundsException ibe）{
System.err.println（值+“格式不正确”）；
}
}

splitter类用于分割数据，因为我发现String的拆分器比较慢。方法是：
public List<String> split() {

    List<String> splitData = new ArrayList<String>();
    int beginIndex = 0, endIndex = 0;

    while(true) {

        endIndex = dataToSplit.indexOf(delim, beginIndex);
        if(endIndex == -1) {
            splitData.add(dataToSplit.substring(beginIndex));
            break;
        }

        splitData.add(dataToSplit.substring(beginIndex, endIndex));
        beginIndex = endIndex + delimLength;
    }

    return splitData;
}

公共列表拆分（）{
List splitData=new ArrayList（）；
int beginIndex=0，endIndex=0；
while（true）{
endIndex=dataToSplit.indexOf（delim，beginIndex）；
如果（endIndex=-1）{
splitData.add（dataToSplit.substring（beginIndex））；
打破
}
splitData.add（dataToSplit.substring（beginIndex，endIndex））；
beginIndex=endIndex+delimLength；
}
返回数据；
}

代码是否可以改进？
让我在这里总结一下我认为您可以改进的地方，而不是在评论中：

如前所述，当前您正在为每条记录创建一个文本对象数次（次数将等于您的令牌数）。虽然对于小投入来说可能不太重要，但对于规模适中的工作来说，这可能是一件大事。要解决此问题，请执行以下操作：
public void map(Text key, Text value, Context context) 
        throws IOException, InterruptedException {

    try {
        Splitter splitter = new Splitter(value.toString(), ":");
        List<String> splits = splitter.split();

        for (String split : splits) {
            context.write(key, new Text(split));
        }

    } catch (IndexOutOfBoundsException ibe) {
        System.err.println(value + " is malformed.");
    }
}

private final Text text = new Text();

public void map(Text key, Text value, Context context) {
    ....
    for (String split : splits) {
        text.set(split);
        context.write(key, text);
    }
}


对于拆分，您现在要做的是为每个记录分配一个新数组，填充该数组，然后迭代该数组以写入输出。实际上，在这种情况下，您实际上不需要数组，因为您没有维护任何状态。使用您提供的split
方法的实现，您只需对数据进行一次传递：
public void map(Text key, Text value, Context context) {
    String dataToSplit = value.toString();
    String delim = ":";

    int beginIndex = 0;
    int endIndex = 0;

    while(true) {
        endIndex = dataToSplit.indexOf(delim, beginIndex);
        if(endIndex == -1) {
            text.set(dataToSplit.substring(beginIndex));
            context.write(key, text);
            break;
        }

        text.set(dataToSplit.substring(beginIndex, endIndex));
        context.write(key, text);
        beginIndex = endIndex + delim.length();
    }
}


我真的不明白为什么要编写自己的InputFormat
，似乎KeyValueTextInputFormat
正是您所需要的，并且可能已经过优化。以下是您如何使用它：
conf.set("key.value.separator.in.input.line", ",");
job.setInputFormatClass(KeyValueTextInputFormat.class);


根据您的示例，每条记录的键似乎是一个整数。如果总是这样，那么使用文本
作为映射器输入键不是最佳选择，它应该是可写的
，或者甚至是字节可写的
，具体取决于输入的内容