Hadoop-解压缩的zip文件

Hadoop-解压缩的zip文件,hadoop,mapreduce,compression,Hadoop,Mapreduce,Compression,我有很多zip格式的压缩文件(GBs),我想写一个仅映射的作业来解压缩它们。我的mapper类看起来像 import java.util.zip.*; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.FileSplit; import org.apache.hadoop.mapred.OutputCollector; import j

我有很多zip格式的压缩文件(GBs),我想写一个仅映射的作业来解压缩它们。我的mapper类看起来像

import java.util.zip.*;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.OutputCollector;
import java.io.*;

public class DecompressMapper extends Mapper <LongWritable, Text, LongWritable, Text>
{
    private static final int BUFFER_SIZE = 4096;

    public void map(LongWritable key, Text value, OutputCollector<LongWritable, Text> output, Context context) throws IOException
    {
        FileSplit fileSplit = (FileSplit)context.getInputSplit();
        String fileName = fileSplit.getPath().getName();
        this.unzip(fileName, new File(fileName).getParent()  + File.separator +  "/test_poc");  
    }

    public void unzip(String zipFilePath, String destDirectory) throws IOException {
        File destDir = new File(destDirectory);
        if (!destDir.exists()) {
            destDir.mkdir();
        }
        ZipInputStream zipIn = new ZipInputStream(new FileInputStream(zipFilePath));
        ZipEntry entry = zipIn.getNextEntry();
        // iterates over entries in the zip file
        while (entry != null) {
            String filePath = destDirectory + File.separator + entry.getName();
            if (!entry.isDirectory()) {
                // if the entry is a file, extracts it
                extractFile(zipIn, filePath);
            } else {
                // if the entry is a directory, make the directory
                File dir = new File(filePath);
                dir.mkdir();
            }
            zipIn.closeEntry();
            entry = zipIn.getNextEntry();
        }
        zipIn.close();
    }

    private void extractFile(ZipInputStream zipIn, String filePath) throws IOException {
        BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(filePath));
        byte[] bytesIn = new byte[BUFFER_SIZE];
        int read = 0;
        while ((read = zipIn.read(bytesIn)) != -1) {
            bos.write(bytesIn, 0, read);
        }
        bos.close();
    }
}

我的mapper类似乎工作不好。我没有得到所需目录中的解压缩文件。感谢您的帮助。谢谢…

以上代码没有什么问题

  • 我正在使用MR1 API和MR2 API。永远不要那样做
  • 使用Java IO函数。Hadoop无法识别其文件系统中的Java IO函数
  • 未正确生成路径

  • <>我们在编写MapReduce程序时需要小心,因为Hadoop使用完全不同的文件系统,我们需要在编写代码时考虑这一点,并且永远不要混合MR1和MR2 API。

    < P> >在Hadoop文件系统中没有解压缩文件的具体方法,但是经过长时间的研究,我已经找到了如何在hadoop文件系统中直接解压缩它的方法。前提是你需要在某个位置复制zip文件,然后运行mapreduce作业。很明显,hadoop不理解zipfile输入格式,因此我们需要定制映射器和reducer,以便控制映射器的发射和消耗。请注意,此Mapreduce将在单个映射器上运行,因为在自定义hadoop提供的记录读取器类时,我们禁用了split方法,即将其设为false。因此,Mapreduce将文件名作为键,未压缩文件的内容作为值。当reducer使用它时,我将output outputkey设置为null,这样reducer只保留未压缩的内容,并且reducer的数量设置为1,这样所有转储都在单个部件文件中

    我们都知道hadoop不能单独处理zip文件,但java可以借助其ZipFile类来处理,ZipFile类可以通过zipinputstrem读取zip文件内容,通过zipentry读取zip条目,因此我们编写了一个定制的ZipInputFormat类,它扩展了FileInputFormat。

    import java.io.IOException;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.BytesWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.InputSplit;
    import org.apache.hadoop.mapreduce.JobContext;
    import org.apache.hadoop.mapreduce.RecordReader;
    import org.apache.hadoop.mapreduce.TaskAttemptContext;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    
    public class ZipFileInputFormat extends FileInputFormat<Text, BytesWritable> {
    /** See the comments on the setLenient() method */
    private static boolean isLenient = false;
    
    /**
     * ZIP files are not splitable so they cannot be overrided so function
     * return false
     */
    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;
    }
    
    /**
     * Create the ZipFileRecordReader to parse the file
     */
    @Override
    public RecordReader<Text, BytesWritable> createRecordReader(
            InputSplit split, TaskAttemptContext context) throws IOException,
            InterruptedException {
        return new ZipFileRecordReader();
    }
    
    /**
     * 
     * @param lenient
     */
    public static void setLenient(boolean lenient) {
        isLenient = lenient;
    }
    
    public static boolean getLenient() {
        return isLenient;
    }
    }
    
    请注意,在作业类中,我们已将InputFormatClass配置为ZipFileInputFormat类,OutputFormatClass配置为TextOutPutFormat类。

    Mavenize项目,让依赖项保持原样来运行代码。导出Jar文件并将其部署到hadoop集群上。在CDH5.5纱线上测试和部署。POM文件的内容如下:

    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    
    <groupId>com.mithun</groupId>
    <artifactId>CustomisedMapperReducer</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>
    
    <name>CustomisedMapperReducer</name>
    <url>http://maven.apache.org</url>
    
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>
    
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.6.0</version>
        </dependency>
    
        <dependency>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-mapper-asl</artifactId>
            <version>1.9.13</version>
        </dependency>
    
    
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
    </project>
    
    
    4.0.0
    com.mithun
    自定义显示减速机
    0.0.1-快照
    罐子
    自定义显示减速机
    http://maven.apache.org
    UTF-8
    org.apache.hadoop
    hadoop客户端
    2.6.0
    org.codehaus.jackson
    杰克逊地图绘制者
    1.9.13
    朱尼特
    朱尼特
    3.8.1
    测验
    
    您在哪个目录中获得输出?我没有在所需目录中获得输出。上面的代码不解压缩压缩文件。正在运行map reduce作业时给定的输出文件夹中获取不需要的输出。有关其他解决方案,请参阅
    import java.io.IOException;
    import java.io.ByteArrayOutputStream;
    import java.io.EOFException;
    import java.io.IOException;
    import java.util.zip.ZipEntry;
    import java.util.zip.ZipException;
    import java.util.zip.ZipInputStream;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FSDataInputStream;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.BytesWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.RecordReader;
    import org.apache.hadoop.mapreduce.InputSplit;
    import org.apache.hadoop.mapreduce.TaskAttemptContext;
    import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    
    public class ZipFileRecordReader extends RecordReader<Text, BytesWritable> {
    /** InputStream used to read the ZIP file from the FileSystem */
    private FSDataInputStream fsin;
    
    /** ZIP file parser/decompresser */
    private ZipInputStream zip;
    
    /** Uncompressed file name */
    private Text currentKey;
    
    /** Uncompressed file contents */
    private BytesWritable currentValue;
    
    /** Used to indicate progress */
    private boolean isFinished = false;
    
    /**
     * Initialise and open the ZIP file from the FileSystem
     */
    @Override
    public void initialize(InputSplit inputSplit,
            TaskAttemptContext taskAttemptContext) throws IOException,
            InterruptedException {
        FileSplit split = (FileSplit) inputSplit;
        Configuration conf = taskAttemptContext.getConfiguration();
        Path path = split.getPath();
        FileSystem fs = path.getFileSystem(conf);
    
        // Open the stream
        fsin = fs.open(path);
        zip = new ZipInputStream(fsin);
    }
    
    /**
     * Each ZipEntry is decompressed and readied for the Mapper. The contents of
     * each file is held *in memory* in a BytesWritable object.
     * 
     * If the ZipFileInputFormat has been set to Lenient (not the default),
     * certain exceptions will be gracefully ignored to prevent a larger job
     * from failing.
     */
    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        ZipEntry entry = null;
        try {
            entry = zip.getNextEntry();
        } catch (ZipException e) {
            if (ZipFileInputFormat.getLenient() == false)
                throw e;
        }
    
        // Sanity check
        if (entry == null) {
            isFinished = true;
            return false;
        }
    
        // Filename
        currentKey = new Text(entry.getName());
    
        if (currentKey.toString().endsWith(".zip")) {
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            byte[] temp1 = new byte[8192];
            while (true) {
                int bytesread1 = 0;
                try {
                    bytesread1 = zip.read(temp1, 0, 8192);
                } catch (EOFException e) {
                    if (ZipFileInputFormat.getLenient() == false)
                        throw e;
                    return false;
                }
                if (bytesread1 > 0)
                    bos.write(temp1, 0, bytesread1);
                else
                    break;
            }
    
            zip.closeEntry();
            currentValue = new BytesWritable(bos.toByteArray());
            return true;
    
        }
    
        // Read the file contents
        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        byte[] temp = new byte[8192];
        while (true) {
            int bytesRead = 0;
            try {
                bytesRead = zip.read(temp, 0, 8192);
            } catch (EOFException e) {
                if (ZipFileInputFormat.getLenient() == false)
                    throw e;
                return false;
            }
            if (bytesRead > 0)
                bos.write(temp, 0, bytesRead);
            else
                break;
        }
        zip.closeEntry();
    
        // Uncompressed contents
        currentValue = new BytesWritable(bos.toByteArray());
        return true;
    }
    
    /**
     * Rather than calculating progress, we just keep it simple
     */
    @Override
    public float getProgress() throws IOException, InterruptedException {
        return isFinished ? 1 : 0;
    }
    
    /**
     * Returns the current key (name of the zipped file)
     */
    @Override
    public Text getCurrentKey() throws IOException, InterruptedException {
        return currentKey;
    }
    
    /**
     * Returns the current value (contents of the zipped file)
     */
    @Override
    public BytesWritable getCurrentValue() throws IOException,
            InterruptedException {
        return currentValue;
    }
    
    /**
     * Close quietly, ignoring any exceptions
     */
    @Override
    public void close() throws IOException {
        try {
            zip.close();
        } catch (Exception ignore) {
        }
        try {
            fsin.close();
        } catch (Exception ignore) {
        }
    }
    }
    
    import java.io.IOException;
    import java.util.StringTokenizer;
    import org.apache.hadoop.io.BytesWritable; 
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.mapreduce.Mapper; 
    
    public class MyMapper extends Mapper<Text, BytesWritable, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    
    public void map(Text key, BytesWritable value, Context context)
            throws IOException, InterruptedException {
    
        String filename = key.toString();
    
        // We only want to process .txt files
        if (filename.endsWith(".txt") == false)
            return;
    
        // Prepare the content
        String content = new String(value.getBytes(), "UTF-8");
    
        context.write(new Text(content), one);
    }
    }
    
    import java.io.IOException;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;
    
    public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        // context.write(key, new IntWritable(sum));
        context.write(new Text(key), null);
    }
    }
    
    import java.io.IOException;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.TextOutputFormat;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    import com.saama.CustomisedMapperReducer.MyMapper;
    import com.saama.CustomisedMapperReducer.MyReducer;
    import com.saama.CustomisedMapperReducer.ZipFileInputFormat;
    import com.saama.CustomisedMapperReducer.ZipFileRecordReader;
    
    public class MyJob {
    
    @SuppressWarnings("deprecation")
    public static void main(String[] args) throws IOException,
            ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
    
        Job job = new Job(conf);
        job.setJarByClass(MyJob.class);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);
    
        job.setInputFormatClass(ZipFileInputFormat.class);
        job.setOutputKeyClass(TextOutputFormat.class);
    
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
    
        ZipFileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setNumReduceTasks(1);
    
        job.waitForCompletion(true);
    
    }
    }
    
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    
    <groupId>com.mithun</groupId>
    <artifactId>CustomisedMapperReducer</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>
    
    <name>CustomisedMapperReducer</name>
    <url>http://maven.apache.org</url>
    
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>
    
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.6.0</version>
        </dependency>
    
        <dependency>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-mapper-asl</artifactId>
            <version>1.9.13</version>
        </dependency>
    
    
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
    </project>