以压缩文件作为输入运行hadoop。hadoop读取的数据输入不按顺序。数字格式异常

以压缩文件作为输入运行hadoop。hadoop读取的数据输入不按顺序。数字格式异常,hadoop,compression,Hadoop,Compression,在修改mapred-site.xml中的属性后,我给出了一个tar.bz2文件、.gz和tar.gz文件作为输入。以上这些似乎都不起作用。这里我假设hadoop读取为输入的记录是无序的,即输入的一列是字符串,另一列是整数,但当从压缩文件中读取它时,由于一些无序数据,hadoop在某个点将字符串部分读取为整数,并生成非法格式异常。我只是个傻瓜。我想知道配置或代码中是否存在问题 core-site.xml中的属性为 <property> <name>io.compress

在修改mapred-site.xml中的属性后,我给出了一个tar.bz2文件、.gz和tar.gz文件作为输入。以上这些似乎都不起作用。这里我假设hadoop读取为输入的记录是无序的,即输入的一列是字符串,另一列是整数,但当从压缩文件中读取它时,由于一些无序数据,hadoop在某个点将字符串部分读取为整数,并生成非法格式异常。我只是个傻瓜。我想知道配置或代码中是否存在问题

core-site.xml中的属性为

<property>
  <name>io.compression.codecs</name>
   <value>org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apac\
he.hadoop.io.compress.SnappyCodec</value>
   <description>A list of the compression codec classes that can be used for compression/decompression.</description>
</property>
<property>
  <name>mapred.compress.map.output</name>
  <value>true</value>
</property>

<property>
   <name>mapred.map.output.compression.codec</name>
   <value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
命令:

hadoop jar mysort.jar org.myorg.MySort MySort/input/ MySort/output
以下是输出:

Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/06/25 11:20:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/06/25 11:20:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/06/25 11:20:29 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14/06/25 11:20:29 INFO input.FileInputFormat: Total input paths to process : 1
14/06/25 11:20:29 INFO mapreduce.JobSubmitter: number of splits:1
14/06/25 11:20:29 INFO Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
14/06/25 11:20:29 INFO Configuration.deprecation: mapred.map.output.compression.codec is deprecated. Instead, use mapreduce.map.output.compress.codec
14/06/25 11:20:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1403675322820_0001
14/06/25 11:20:30 INFO impl.YarnClientImpl: Submitted application application_1403675322820_0001
14/06/25 11:20:30 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1403675322820_0001/
14/06/25 11:20:30 INFO mapreduce.Job: Running job: job_1403675322820_0001
14/06/25 11:20:52 INFO mapreduce.Job: Job job_1403675322820_0001 running in uber mode : false
14/06/25 11:20:52 INFO mapreduce.Job:  map 0% reduce 0%
14/06/25 11:21:10 INFO mapreduce.Job: Task Id : attempt_1403675322820_0001_m_000000_0, Status : FAILED
Error: java.lang.NumberFormatException: For input string: "0ustar"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at org.myorg.MySort$Map.map(MySort.java:36)
    at org.myorg.MySort$Map.map(MySort.java:23)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

14/06/25 11:21:29 INFO mapreduce.Job: Task Id : attempt_1403675322820_0001_m_000000_1, Status : FAILED
Error: java.lang.NumberFormatException: For input string: "0ustar"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at org.myorg.MySort$Map.map(MySort.java:36)
    at org.myorg.MySort$Map.map(MySort.java:23)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

14/06/25 11:21:49 INFO mapreduce.Job: Task Id : attempt_1403675322820_0001_m_000000_2, Status : FAILED
Error: java.lang.NumberFormatException: For input string: "0ustar"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at org.myorg.MySort$Map.map(MySort.java:36)
    at org.myorg.MySort$Map.map(MySort.java:23)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

14/06/25 11:22:10 INFO mapreduce.Job:  map 100% reduce 100%
14/06/25 11:22:10 INFO mapreduce.Job: Job job_1403675322820_0001 failed with state FAILED due to: Task failed task_1403675322820_0001_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

14/06/25 11:22:10 INFO mapreduce.Job: Counters: 9
    Job Counters 
        Failed map tasks=4
        Launched map tasks=4
        Other local map tasks=3
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=69797
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=69797
        Total vcore-seconds taken by all map tasks=69797
        Total megabyte-seconds taken by all map tasks=71472128
它成功地创建了压缩文件

hadoop fs -ls MySort/zip1

Found 3 items
-rw-r--r--   1 hduser supergroup          0 2014-06-25 10:43 MySort/zip1/_SUCCESS
-rw-r--r--   1 hduser supergroup   42488018 2014-06-25 10:43 MySort/zip1/part-00000.bz2
-rw-r--r--   1 hduser supergroup   42504084 2014-06-25 10:43 MySort/zip1/part-00001.bz2
然后运行以下命令:

hadoop jar mysort.jar org.myorg.MySort MySort/input/ MySort/zip1
它仍然不起作用。这里有什么我遗漏的吗

当我运行它而不使用压缩文件bz2并直接将文本文件Data/Data.txt传递给它时,它工作正常,即将其上传到hdfs中的MySort/input(hadoop fs-put Data/Data.txt MySort/input)


非常感谢您的帮助

我为此做了一些变通。我用了一个工具跑步器

package org.myorg;

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.util.NativeCodeLoader;        
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionInputStream;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.Decompressor;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.io.compress.GzipCodec;        
import org.apache.hadoop.io.compress.*;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class ToolMapReduce extends Configured implements Tool 
{


    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> 
    {
        private final static IntWritable Marks = new IntWritable();
        private Text name = new Text();
        String one,two;
        int num;
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException 
        {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) 
            {
            one=tokenizer.nextToken();
            name.set(one);
            if(tokenizer.hasMoreTokens())
                two=tokenizer.nextToken();
            num=Integer.parseInt(two);
            Marks.set(num);
            context.write(name, Marks);
            }
        }
    } 

    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> 
    {

        public void reduce(Text key, Iterable<IntWritable> values, Context context) 
        throws IOException, InterruptedException 
        {
            int sum = 0;
            for (IntWritable val : values) 
            {
            sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception  
    {
        int res = ToolRunner.run(new Configuration(), new ToolMapReduce(), args);
        System.exit(res);
    }

    public int run(String[] args) throws Exception
    {   

        Configuration conf = this.getConf();
        //Configuration conf = new Configuration();
        //conf.setOutputFormat(SequenceFileOutputFormat.class); 
        //SequenceFileOutputFormat.setOutputCompressionType(conf, CompressionType.BLOCK); 
        //SequenceFileOutputFormat.setCompressOutput(conf, true); 
        //conf.set("mapred.output.compress","true");
        //  conf.set("mapred.output.compression","org.apache.hadoop.io.compress.SnappyCodec");

        //conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
        //  conf.set("mapreduce.job.inputformat.class", "com.wizecommerce.utils.mapred.TextInputFormat");

        //  conf.set("mapreduce.job.outputformat.class", "com.wizecommerce.utils.mapred.TextOutputFormat");
        //  conf.setBoolean("mapreduce.map.output.compress",true);
        conf.setBoolean("mapred.output.compress",true);
        //conf.setBoolean("mapreduce.output.fileoutputformat.compress",false);
        //conf.setBoolean("mapreduce.map.output.compress",true);
        conf.set("mapred.output.compression.type", "BLOCK");     
        //conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);
        //      conf.setClass("mapred.map.output.compression.codec", GzipCodec.class, CompressionCodec.class);
        conf.setClass("mapred.map.output.compression.codec", GzipCodec.class, CompressionCodec.class);
        Job job = new Job(conf, "mysort");
        job.setJarByClass(org.myorg.ToolMapReduce.class);
        //job.setJarByClass(org.myorg.MySort.class);
        job.setJobName("mysort");
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        //  FileInputFormat.setCompressInput(job,true);
        FileOutputFormat.setCompressOutput(job, true);
        //FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
        //  conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString()); 

        FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
        //job.waitForCompletion(true);
    }


}
package org.myorg;
导入java.io.IOException;
导入java.util.*;
导入org.apache.hadoop.util.NativeCodeLoader;
导入org.apache.hadoop.fs.Path;
导入org.apache.hadoop.conf.*;
导入org.apache.hadoop.io.compress.CompressionCodec;
导入org.apache.hadoop.io.compress.CompressionInputStream;
导入org.apache.hadoop.io.compress.CompressionOutputStream;
导入org.apache.hadoop.io.compress.Decompressor;
导入org.apache.hadoop.io.*;
导入org.apache.hadoop.mapreduce.*;
导入org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
导入org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
导入org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
导入org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
导入org.apache.hadoop.io.compress.gzip代码;
导入org.apache.hadoop.io.compress.*;
导入org.apache.hadoop.io.compress.BZip2Codec;
导入org.apache.hadoop.util.Tool;
导入org.apache.hadoop.util.ToolRunner;
公共类ToolMapReduce扩展配置的实现工具
{
公共静态类映射扩展映射器
{
私有最终静态IntWritable标记=新的IntWritable();
私有文本名称=新文本();
一二线,;
int-num;
公共void映射(LongWritable键、文本值、上下文上下文)引发IOException、InterruptedException
{
字符串行=value.toString();
StringTokenizer标记器=新的StringTokenizer(行);
while(tokenizer.hasMoreTokens())
{
一=标记器.nextToken();
名称.套(一套);
if(tokenizer.hasMoreTokens())
two=标记器.nextToken();
num=Integer.parseInt(两个);
标记集(num);
上下文。写(姓名、标记);
}
}
} 
公共静态类Reduce扩展Reducer
{
公共void reduce(文本键、Iterable值、上下文)
抛出IOException、InterruptedException
{
整数和=0;
for(可写入值:值)
{
sum+=val.get();
}
write(key,newintwriteable(sum));
}
}
公共静态void main(字符串[]args)引发异常
{
int res=ToolRunner.run(新配置(),新ToolMapReduce(),args);
系统退出(res);
}
公共int运行(字符串[]args)引发异常
{   
Configuration=this.getConf();
//Configuration conf=新配置();
//conf.setOutputFormat(SequenceFileOutputFormat.class);
//setOutputCompressionType(conf,CompressionType.BLOCK);
//setCompressOutput(conf,true);
//conf.set(“mapred.output.compress”、“true”);
//conf.set(“mapred.output.compression”、“org.apache.hadoop.io.compress.SnappyCodec”);
//conf.set(“mapred.output.compression.codec”,“org.apache.hadoop.io.compress.SnappyCodec”);
//conf.set(“mapreduce.job.inputformat.class”、“com.wizecommerce.utils.mapred.TextInputFormat”);
//conf.set(“mapreduce.job.outputformat.class”、“com.wizecommerce.utils.mapred.TextOutputFormat”);
//conf.setBoolean(“mapreduce.map.output.compress”,true);
conf.setBoolean(“mapred.output.compress”,true);
//conf.setBoolean(“mapreduce.output.fileoutputformat.compress”,false);
//conf.setBoolean(“mapreduce.map.output.compress”,true);
conf.set(“mapred.output.compression.type”、“BLOCK”);
//conf.setClass(“mapreduce.map.output.compress.codec”,BZip2Codec.class,CompressionCodec.class);
//conf.setClass(“mapred.map.output.compression.codec”,gzicodec.class,CompressionCodec.class);
conf.setClass(“mapred.map.output.compression.codec”,gzicodec.class,CompressionCodec.class);
作业作业=新作业(conf,“mysort”);
setJarByClass(org.myorg.ToolMapReduce.class);
//job.setJarByClass(org.myorg.MySort.class);
job.setJobName(“mysort”);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
setInputFormatClass(TextInputFormat.class);
setOutputFormatClass(TextOutputFormat.class);
//setCompressInput(作业,true);
FileOutputFormat.setCompressOutput(作业,true);
//setOutputCompressorClass(作业,gzip代码.class);
//conf.set(“mapred.output.compression.type”,CompressionType.BLOCK.toString());
setOutputCompressorClass(作业,gzip代码.class);
addInputPath(作业,新路径(args[0]);
hadoop fs -ls MySort/zip1

Found 3 items
-rw-r--r--   1 hduser supergroup          0 2014-06-25 10:43 MySort/zip1/_SUCCESS
-rw-r--r--   1 hduser supergroup   42488018 2014-06-25 10:43 MySort/zip1/part-00000.bz2
-rw-r--r--   1 hduser supergroup   42504084 2014-06-25 10:43 MySort/zip1/part-00001.bz2
hadoop jar mysort.jar org.myorg.MySort MySort/input/ MySort/zip1
package org.myorg;

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.util.NativeCodeLoader;        
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionInputStream;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.Decompressor;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.io.compress.GzipCodec;        
import org.apache.hadoop.io.compress.*;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class ToolMapReduce extends Configured implements Tool 
{


    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> 
    {
        private final static IntWritable Marks = new IntWritable();
        private Text name = new Text();
        String one,two;
        int num;
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException 
        {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) 
            {
            one=tokenizer.nextToken();
            name.set(one);
            if(tokenizer.hasMoreTokens())
                two=tokenizer.nextToken();
            num=Integer.parseInt(two);
            Marks.set(num);
            context.write(name, Marks);
            }
        }
    } 

    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> 
    {

        public void reduce(Text key, Iterable<IntWritable> values, Context context) 
        throws IOException, InterruptedException 
        {
            int sum = 0;
            for (IntWritable val : values) 
            {
            sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }

    public static void main(String[] args) throws Exception  
    {
        int res = ToolRunner.run(new Configuration(), new ToolMapReduce(), args);
        System.exit(res);
    }

    public int run(String[] args) throws Exception
    {   

        Configuration conf = this.getConf();
        //Configuration conf = new Configuration();
        //conf.setOutputFormat(SequenceFileOutputFormat.class); 
        //SequenceFileOutputFormat.setOutputCompressionType(conf, CompressionType.BLOCK); 
        //SequenceFileOutputFormat.setCompressOutput(conf, true); 
        //conf.set("mapred.output.compress","true");
        //  conf.set("mapred.output.compression","org.apache.hadoop.io.compress.SnappyCodec");

        //conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
        //  conf.set("mapreduce.job.inputformat.class", "com.wizecommerce.utils.mapred.TextInputFormat");

        //  conf.set("mapreduce.job.outputformat.class", "com.wizecommerce.utils.mapred.TextOutputFormat");
        //  conf.setBoolean("mapreduce.map.output.compress",true);
        conf.setBoolean("mapred.output.compress",true);
        //conf.setBoolean("mapreduce.output.fileoutputformat.compress",false);
        //conf.setBoolean("mapreduce.map.output.compress",true);
        conf.set("mapred.output.compression.type", "BLOCK");     
        //conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);
        //      conf.setClass("mapred.map.output.compression.codec", GzipCodec.class, CompressionCodec.class);
        conf.setClass("mapred.map.output.compression.codec", GzipCodec.class, CompressionCodec.class);
        Job job = new Job(conf, "mysort");
        job.setJarByClass(org.myorg.ToolMapReduce.class);
        //job.setJarByClass(org.myorg.MySort.class);
        job.setJobName("mysort");
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        //  FileInputFormat.setCompressInput(job,true);
        FileOutputFormat.setCompressOutput(job, true);
        //FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
        //  conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString()); 

        FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        return job.waitForCompletion(true) ? 0 : 1;
        //job.waitForCompletion(true);
    }


}