Java hadoop中作为单个输入格式的多输入格式_Java_Xml_Hadoop

Java hadoop中作为单个输入格式的多输入格式

java xml hadoop

Java hadoop中作为单个输入格式的多输入格式,java,xml,hadoop,Java,Xml,Hadoop,我面临以下情况。请帮帮我。Im使用hadoop Mapreduce处理XML文件通过参考这个网站，我可以偷偷地翻阅我的记录但是当XML文件的大小大于块大小时，我没有得到正确的值所以我需要读取整个文件为此我得到了这个链接但是现在的问题是如何将两个inputformat实现为一个inputformat 请尽快帮助我谢谢更新 public class XmlParser11 { public static class XmlInputFormat1 extends T

我面临以下情况。请帮帮我。Im使用hadoop Mapreduce处理XML文件

通过参考这个网站，我可以偷偷地翻阅我的记录但是当XML文件的大小大于块大小时，我没有得到正确的值所以我需要读取整个文件为此我得到了这个链接

但是现在的问题是如何将两个inputformat实现为一个inputformat

请尽快帮助我谢谢

更新

public class XmlParser11
{

        public static class XmlInputFormat1 extends TextInputFormat {

        public static final String START_TAG_KEY = "xmlinput.start";
        public static final String END_TAG_KEY = "xmlinput.end";

        @Override
    protected boolean isSplitable(JobContext context, Path file) {
        return false;
        }


        public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
            return new XmlRecordReader();
        }

        /**
         * XMLRecordReader class to read through a given xml document to output
         * xml blocks as records as specified by the start tag and end tag
         *
         */


        public static class XmlRecordReader extends RecordReader<LongWritable, Text> {
            private byte[] startTag;
            private byte[] endTag;
            private long start;
            private long end;
            private FSDataInputStream fsin;
            private DataOutputBuffer buffer = new DataOutputBuffer();

            private LongWritable key = new LongWritable();
            private Text value = new Text();
            @Override
            public void initialize(InputSplit split, TaskAttemptContext context)
                    throws IOException, InterruptedException {
                Configuration conf = context.getConfiguration();
                startTag = conf.get(START_TAG_KEY).getBytes("utf-8");
                endTag = conf.get(END_TAG_KEY).getBytes("utf-8");
                FileSplit fileSplit = (FileSplit) split;

公共类XmlParser11
{
公共静态类XmlInputFormat1扩展了TextInputFormat{
公共静态最终字符串START\u TAG\u KEY=“xmlinput.START”；
公共静态最终字符串END_TAG_KEY=“xmlinput.END”；
@凌驾
受保护的布尔isSplitable（JobContext上下文，路径文件）{
返回false；
}
public RecordReader createRecordReader（InputSplit拆分，TaskAttemptContext上下文）{
返回新的XmlRecordReader（）；
}
/**
*XMLRecordReader类读取给定的xml文档以输出
*xml块作为开始标记和结束标记指定的记录
*
*/
公共静态类XmlRecordReader扩展了RecordReader{
专用字节[]startTag；
私有字节[]endTag；
私人长期启动；
私人长尾；
私有FSDataInputStream-fsin；
私有DataOutputBuffer=新DataOutputBuffer（）；
私有LongWritable密钥=新的LongWritable（）；
私有文本值=新文本（）；
@凌驾
公共void初始化（InputSplit拆分，TaskAttemptContext上下文）
抛出IOException、InterruptedException{
conf=context.getConfiguration（）；
startTag=conf.get（START_TAG_KEY）.getBytes（“utf-8”）；
endTag=conf.get（END_TAG_KEY）.getBytes（“utf-8”）；
filesplitfilesplit=（FileSplit）拆分；

但不起作用

使用isSplitable属性指定否来分割文件（即使达到块大小）。这通常用于确保大文件应由单个映射器处理

public class XmlInputFormat extends FileInputFormat {
@Override
 protected boolean isSplitable(JobContext context, Path file) {
 return false;
}

@Override
 public RecordReader<LongWritable, Text> createRecordReader(InputSplit split,TaskAttemptContext context)
 throws IOException {
  // return your version of XML record reader
 }
}

但是我们需要正确地编写RecordReader。我有一个用于xml阅读器的RecordReader，那么如何将整个文件阅读器合并到其中呢

// Set the maximum split size
setMaxSplitSize(MAX_INPUT_SPLIT_SIZE);