Hadoop 重写RecordReader以立即读取段落而不是行_Hadoop

Hadoop 重写RecordReader以立即读取段落而不是行

hadoop

Hadoop 重写RecordReader以立即读取段落而不是行,hadoop,Hadoop,我正在重写RecordReader类的“next”方法和TextInputFormat类的“getRecordReader”方法，以便将整个段落发送到映射器，而不是逐行发送。（我使用的是旧api，在文本文件中出现空行之前，我的段落的定义是附加的。）下面是我的代码： public class NLinesInputFormat extends TextInputFormat { @Override public RecordReader<LongWritable, Te

我正在重写RecordReader类的“next”方法和TextInputFormat类的“getRecordReader”方法，以便将整个段落发送到映射器，而不是逐行发送。（我使用的是旧api，在文本文件中出现空行之前，我的段落的定义是附加的。）
下面是我的代码：

public class NLinesInputFormat extends TextInputFormat  
{  
   @Override
   public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter)throws IOException     {   
        reporter.setStatus(split.toString());  
        return new ParagraphRecordReader(conf, (FileSplit)split);
    }
}



public class ParagraphRecordReader implements RecordReader<LongWritable, Text> 
{
        private LineRecordReader lineRecord;
        private LongWritable lineKey;
        private Text lineValue;
        public ParagraphRecordReader(JobConf conf, FileSplit split) throws IOException {
            lineRecord = new LineRecordReader(conf, split);
            lineKey = lineRecord.createKey();
            lineValue = lineRecord.createValue();
        }

        @Override
        public void close() throws IOException {
            lineRecord.close();
        }

        @Override
        public LongWritable createKey() {
            return new LongWritable();

        }

        @Override
        public Text createValue() {
            return new Text("");

        }

        @Override
        public float getProgress() throws IOException {
            return lineRecord.getPos();

        }

        @Override
        public synchronized boolean next(LongWritable key, Text value) throws IOException {
            boolean appended, gotsomething;
            boolean retval;
            byte space[] = {' '};
            value.clear();
            gotsomething = false;
            do {
                appended = false;
                retval = lineRecord.next(lineKey, lineValue);
                if (retval) {
                    if (lineValue.toString().length() > 0) {
                        byte[] rawline = lineValue.getBytes();
                        int rawlinelen = lineValue.getLength();
                        value.append(rawline, 0, rawlinelen);
                        value.append(space, 0, 1);
                        appended = true;
                    }
                    gotsomething = true;
                }
            } while (appended);

            //System.out.println("ParagraphRecordReader::next() returns "+gotsomething+" after setting value to: ["+value.toString()+"]");
            return gotsomething;
        }

        @Override
        public long getPos() throws IOException {
            return lineRecord.getPos();
        }
    }

public类NLinesInputFormat扩展了TextInputFormat
{  
@凌驾
公共RecordReader getRecordReader（InputSplit split、JobConf conf、Reporter Reporter）抛出IOException{
reporter.setStatus（split.toString（））；
返回新的段落记录阅读器（conf，（FileSplit）split）；
}
}
公共类ParagraphRecordReader实现了RecordReader
{
专用LineRecordReader lineRecord；
私有长写线路密钥；
私有文本行值；
公共段落记录读取器（JobConf conf，FileSplit split）引发IOException{
lineRecord=新的LineRecordReader（conf，split）；
lineKey=lineRecord.createKey（）；
lineValue=lineRecord.createValue（）；
}
@凌驾
public void close（）引发IOException{
lineRecord.close（）；
}
@凌驾
公共长可写createKey（）{
返回新的LongWritable（）；
}
@凌驾
公共文本createValue（）{
返回新文本（“”）；
}
@凌驾
公共浮点getProgress（）引发IOException{
return lineRecord.getPos（）；
}
@凌驾
公共同步布尔next（LongWritable键，文本值）引发IOException{
布尔追加，得到某物；
布尔返回；
字节空间[]={'}；
value.clear（）；
gotsomething=false；
做{
附加=假；
retval=lineRecord.next（lineKey，lineValue）；
如果（返回）{
if（lineValue.toString（）.length（）>0）{
byte[]rawline=lineValue.getBytes（）；
int rawlinelen=lineValue.getLength（）；
value.append（rawline，0，rawlinelen）；
追加（空格，0，1）；
附加=真；
}
gottsomething=true；
}
}while（追加）；
//System.out.println（“ParagraphRecordReader:：next（）在将值设置为：[“+value.toString（）+”]后返回“+gotsomething+”；
归还某物；
}
@凌驾
public long getPos（）引发IOException{
return lineRecord.getPos（）；
}
}

问题：
1.我没有找到任何具体的指南，因此可能是我做错了什么，请评论任何建议？

2.我能够正确地编译它，但是当我运行我的作业时，我的映射器一直在运行，我无法找出问题出在哪里

你的代码对我来说非常好。我所做的唯一改变是将这些类作为内部类，并使它们成为静态的

输入文件如下所示：

This is awesome.
WTF is this.

This is just a test.

conf.setInputFormat(NLinesInputFormat.class);

映射程序代码如下所示：

@Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
    throws IOException {

    System.out.println(key+" : "+value);
}

我相信您不会忘记设置输入格式，但为了以防万一，请按如下方式设置：

This is awesome.
WTF is this.

This is just a test.

conf.setInputFormat(NLinesInputFormat.class);

你试过只输入一个段落吗？我想你有一个bug；交叉拆分时，您将获得额外的段落。我认为您需要区分从0开始的分割和每隔一次分割。以0开头的第一行开始一个段落，但以行开头的拆分不应开始新段落。（通常情况下，您读取的内容会超过拆分边界，因此如果您的拆分包含延续段落的行，则它们将由上一次拆分发出）。我遗漏了什么吗？谢谢你的回复！。。我将这些类用作公共静态类，并设置Inputformat，但我没有尝试使用小段落，而是使用大文件进行测试。我会这样做，让你们知道它是如何进行的。嘿，谢谢，伙计…我检查了短输入文件，它对长文件工作正常。这是一些格式问题，我已经解决了@Amar我是hadoop的初学者，你能解释下一个方法中发生了什么吗？你能解释一下实现的逻辑吗？我需要一些帮助。

next（）

基本上是决定下一条记录是映射器的问题，默认实现会发出一行，在本例中，我们需要将完整的段落作为单个记录传递给映射器，因此我们重写

next（）

。现在，一个段落应该被定义为所有行的集合，直到存在连续的两个换行符（'/n'），现在这是使用

LineRecordReader实现的，我们继续累积所有行，直到得到一个空行。