以压缩文件作为输入运行hadoop。hadoop读取的数据输入不按顺序。数字格式异常
在修改mapred-site.xml中的属性后,我给出了一个tar.bz2文件、.gz和tar.gz文件作为输入。以上这些似乎都不起作用。这里我假设hadoop读取为输入的记录是无序的,即输入的一列是字符串,另一列是整数,但当从压缩文件中读取它时,由于一些无序数据,hadoop在某个点将字符串部分读取为整数,并生成非法格式异常。我只是个傻瓜。我想知道配置或代码中是否存在问题 core-site.xml中的属性为以压缩文件作为输入运行hadoop。hadoop读取的数据输入不按顺序。数字格式异常,hadoop,compression,Hadoop,Compression,在修改mapred-site.xml中的属性后,我给出了一个tar.bz2文件、.gz和tar.gz文件作为输入。以上这些似乎都不起作用。这里我假设hadoop读取为输入的记录是无序的,即输入的一列是字符串,另一列是整数,但当从压缩文件中读取它时,由于一些无序数据,hadoop在某个点将字符串部分读取为整数,并生成非法格式异常。我只是个傻瓜。我想知道配置或代码中是否存在问题 core-site.xml中的属性为 <property> <name>io.compress
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apac\
he.hadoop.io.compress.SnappyCodec</value>
<description>A list of the compression codec classes that can be used for compression/decompression.</description>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
命令:
hadoop jar mysort.jar org.myorg.MySort MySort/input/ MySort/output
以下是输出:
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/06/25 11:20:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/06/25 11:20:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/06/25 11:20:29 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14/06/25 11:20:29 INFO input.FileInputFormat: Total input paths to process : 1
14/06/25 11:20:29 INFO mapreduce.JobSubmitter: number of splits:1
14/06/25 11:20:29 INFO Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
14/06/25 11:20:29 INFO Configuration.deprecation: mapred.map.output.compression.codec is deprecated. Instead, use mapreduce.map.output.compress.codec
14/06/25 11:20:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1403675322820_0001
14/06/25 11:20:30 INFO impl.YarnClientImpl: Submitted application application_1403675322820_0001
14/06/25 11:20:30 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1403675322820_0001/
14/06/25 11:20:30 INFO mapreduce.Job: Running job: job_1403675322820_0001
14/06/25 11:20:52 INFO mapreduce.Job: Job job_1403675322820_0001 running in uber mode : false
14/06/25 11:20:52 INFO mapreduce.Job: map 0% reduce 0%
14/06/25 11:21:10 INFO mapreduce.Job: Task Id : attempt_1403675322820_0001_m_000000_0, Status : FAILED
Error: java.lang.NumberFormatException: For input string: "0ustar"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at org.myorg.MySort$Map.map(MySort.java:36)
at org.myorg.MySort$Map.map(MySort.java:23)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
14/06/25 11:21:29 INFO mapreduce.Job: Task Id : attempt_1403675322820_0001_m_000000_1, Status : FAILED
Error: java.lang.NumberFormatException: For input string: "0ustar"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at org.myorg.MySort$Map.map(MySort.java:36)
at org.myorg.MySort$Map.map(MySort.java:23)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
14/06/25 11:21:49 INFO mapreduce.Job: Task Id : attempt_1403675322820_0001_m_000000_2, Status : FAILED
Error: java.lang.NumberFormatException: For input string: "0ustar"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at org.myorg.MySort$Map.map(MySort.java:36)
at org.myorg.MySort$Map.map(MySort.java:23)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
14/06/25 11:22:10 INFO mapreduce.Job: map 100% reduce 100%
14/06/25 11:22:10 INFO mapreduce.Job: Job job_1403675322820_0001 failed with state FAILED due to: Task failed task_1403675322820_0001_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
14/06/25 11:22:10 INFO mapreduce.Job: Counters: 9
Job Counters
Failed map tasks=4
Launched map tasks=4
Other local map tasks=3
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=69797
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=69797
Total vcore-seconds taken by all map tasks=69797
Total megabyte-seconds taken by all map tasks=71472128
它成功地创建了压缩文件
hadoop fs -ls MySort/zip1
Found 3 items
-rw-r--r-- 1 hduser supergroup 0 2014-06-25 10:43 MySort/zip1/_SUCCESS
-rw-r--r-- 1 hduser supergroup 42488018 2014-06-25 10:43 MySort/zip1/part-00000.bz2
-rw-r--r-- 1 hduser supergroup 42504084 2014-06-25 10:43 MySort/zip1/part-00001.bz2
然后运行以下命令:
hadoop jar mysort.jar org.myorg.MySort MySort/input/ MySort/zip1
它仍然不起作用。这里有什么我遗漏的吗
当我运行它而不使用压缩文件bz2并直接将文本文件Data/Data.txt传递给它时,它工作正常,即将其上传到hdfs中的MySort/input(hadoop fs-put Data/Data.txt MySort/input)
非常感谢您的帮助我为此做了一些变通。我用了一个工具跑步器
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.util.NativeCodeLoader;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionInputStream;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.Decompressor;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.io.compress.*;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class ToolMapReduce extends Configured implements Tool
{
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable Marks = new IntWritable();
private Text name = new Text();
String one,two;
int num;
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
one=tokenizer.nextToken();
name.set(one);
if(tokenizer.hasMoreTokens())
two=tokenizer.nextToken();
num=Integer.parseInt(two);
Marks.set(num);
context.write(name, Marks);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception
{
int res = ToolRunner.run(new Configuration(), new ToolMapReduce(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
Configuration conf = this.getConf();
//Configuration conf = new Configuration();
//conf.setOutputFormat(SequenceFileOutputFormat.class);
//SequenceFileOutputFormat.setOutputCompressionType(conf, CompressionType.BLOCK);
//SequenceFileOutputFormat.setCompressOutput(conf, true);
//conf.set("mapred.output.compress","true");
// conf.set("mapred.output.compression","org.apache.hadoop.io.compress.SnappyCodec");
//conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
// conf.set("mapreduce.job.inputformat.class", "com.wizecommerce.utils.mapred.TextInputFormat");
// conf.set("mapreduce.job.outputformat.class", "com.wizecommerce.utils.mapred.TextOutputFormat");
// conf.setBoolean("mapreduce.map.output.compress",true);
conf.setBoolean("mapred.output.compress",true);
//conf.setBoolean("mapreduce.output.fileoutputformat.compress",false);
//conf.setBoolean("mapreduce.map.output.compress",true);
conf.set("mapred.output.compression.type", "BLOCK");
//conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);
// conf.setClass("mapred.map.output.compression.codec", GzipCodec.class, CompressionCodec.class);
conf.setClass("mapred.map.output.compression.codec", GzipCodec.class, CompressionCodec.class);
Job job = new Job(conf, "mysort");
job.setJarByClass(org.myorg.ToolMapReduce.class);
//job.setJarByClass(org.myorg.MySort.class);
job.setJobName("mysort");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
// FileInputFormat.setCompressInput(job,true);
FileOutputFormat.setCompressOutput(job, true);
//FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
// conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString());
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
//job.waitForCompletion(true);
}
}
package org.myorg;
导入java.io.IOException;
导入java.util.*;
导入org.apache.hadoop.util.NativeCodeLoader;
导入org.apache.hadoop.fs.Path;
导入org.apache.hadoop.conf.*;
导入org.apache.hadoop.io.compress.CompressionCodec;
导入org.apache.hadoop.io.compress.CompressionInputStream;
导入org.apache.hadoop.io.compress.CompressionOutputStream;
导入org.apache.hadoop.io.compress.Decompressor;
导入org.apache.hadoop.io.*;
导入org.apache.hadoop.mapreduce.*;
导入org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
导入org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
导入org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
导入org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
导入org.apache.hadoop.io.compress.gzip代码;
导入org.apache.hadoop.io.compress.*;
导入org.apache.hadoop.io.compress.BZip2Codec;
导入org.apache.hadoop.util.Tool;
导入org.apache.hadoop.util.ToolRunner;
公共类ToolMapReduce扩展配置的实现工具
{
公共静态类映射扩展映射器
{
私有最终静态IntWritable标记=新的IntWritable();
私有文本名称=新文本();
一二线,;
int-num;
公共void映射(LongWritable键、文本值、上下文上下文)引发IOException、InterruptedException
{
字符串行=value.toString();
StringTokenizer标记器=新的StringTokenizer(行);
while(tokenizer.hasMoreTokens())
{
一=标记器.nextToken();
名称.套(一套);
if(tokenizer.hasMoreTokens())
two=标记器.nextToken();
num=Integer.parseInt(两个);
标记集(num);
上下文。写(姓名、标记);
}
}
}
公共静态类Reduce扩展Reducer
{
公共void reduce(文本键、Iterable值、上下文)
抛出IOException、InterruptedException
{
整数和=0;
for(可写入值:值)
{
sum+=val.get();
}
write(key,newintwriteable(sum));
}
}
公共静态void main(字符串[]args)引发异常
{
int res=ToolRunner.run(新配置(),新ToolMapReduce(),args);
系统退出(res);
}
公共int运行(字符串[]args)引发异常
{
Configuration=this.getConf();
//Configuration conf=新配置();
//conf.setOutputFormat(SequenceFileOutputFormat.class);
//setOutputCompressionType(conf,CompressionType.BLOCK);
//setCompressOutput(conf,true);
//conf.set(“mapred.output.compress”、“true”);
//conf.set(“mapred.output.compression”、“org.apache.hadoop.io.compress.SnappyCodec”);
//conf.set(“mapred.output.compression.codec”,“org.apache.hadoop.io.compress.SnappyCodec”);
//conf.set(“mapreduce.job.inputformat.class”、“com.wizecommerce.utils.mapred.TextInputFormat”);
//conf.set(“mapreduce.job.outputformat.class”、“com.wizecommerce.utils.mapred.TextOutputFormat”);
//conf.setBoolean(“mapreduce.map.output.compress”,true);
conf.setBoolean(“mapred.output.compress”,true);
//conf.setBoolean(“mapreduce.output.fileoutputformat.compress”,false);
//conf.setBoolean(“mapreduce.map.output.compress”,true);
conf.set(“mapred.output.compression.type”、“BLOCK”);
//conf.setClass(“mapreduce.map.output.compress.codec”,BZip2Codec.class,CompressionCodec.class);
//conf.setClass(“mapred.map.output.compression.codec”,gzicodec.class,CompressionCodec.class);
conf.setClass(“mapred.map.output.compression.codec”,gzicodec.class,CompressionCodec.class);
作业作业=新作业(conf,“mysort”);
setJarByClass(org.myorg.ToolMapReduce.class);
//job.setJarByClass(org.myorg.MySort.class);
job.setJobName(“mysort”);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
setInputFormatClass(TextInputFormat.class);
setOutputFormatClass(TextOutputFormat.class);
//setCompressInput(作业,true);
FileOutputFormat.setCompressOutput(作业,true);
//setOutputCompressorClass(作业,gzip代码.class);
//conf.set(“mapred.output.compression.type”,CompressionType.BLOCK.toString());
setOutputCompressorClass(作业,gzip代码.class);
addInputPath(作业,新路径(args[0]);
hadoop fs -ls MySort/zip1
Found 3 items
-rw-r--r-- 1 hduser supergroup 0 2014-06-25 10:43 MySort/zip1/_SUCCESS
-rw-r--r-- 1 hduser supergroup 42488018 2014-06-25 10:43 MySort/zip1/part-00000.bz2
-rw-r--r-- 1 hduser supergroup 42504084 2014-06-25 10:43 MySort/zip1/part-00001.bz2
hadoop jar mysort.jar org.myorg.MySort MySort/input/ MySort/zip1
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.util.NativeCodeLoader;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionInputStream;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.Decompressor;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.io.compress.*;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class ToolMapReduce extends Configured implements Tool
{
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable Marks = new IntWritable();
private Text name = new Text();
String one,two;
int num;
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
one=tokenizer.nextToken();
name.set(one);
if(tokenizer.hasMoreTokens())
two=tokenizer.nextToken();
num=Integer.parseInt(two);
Marks.set(num);
context.write(name, Marks);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception
{
int res = ToolRunner.run(new Configuration(), new ToolMapReduce(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
Configuration conf = this.getConf();
//Configuration conf = new Configuration();
//conf.setOutputFormat(SequenceFileOutputFormat.class);
//SequenceFileOutputFormat.setOutputCompressionType(conf, CompressionType.BLOCK);
//SequenceFileOutputFormat.setCompressOutput(conf, true);
//conf.set("mapred.output.compress","true");
// conf.set("mapred.output.compression","org.apache.hadoop.io.compress.SnappyCodec");
//conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
// conf.set("mapreduce.job.inputformat.class", "com.wizecommerce.utils.mapred.TextInputFormat");
// conf.set("mapreduce.job.outputformat.class", "com.wizecommerce.utils.mapred.TextOutputFormat");
// conf.setBoolean("mapreduce.map.output.compress",true);
conf.setBoolean("mapred.output.compress",true);
//conf.setBoolean("mapreduce.output.fileoutputformat.compress",false);
//conf.setBoolean("mapreduce.map.output.compress",true);
conf.set("mapred.output.compression.type", "BLOCK");
//conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);
// conf.setClass("mapred.map.output.compression.codec", GzipCodec.class, CompressionCodec.class);
conf.setClass("mapred.map.output.compression.codec", GzipCodec.class, CompressionCodec.class);
Job job = new Job(conf, "mysort");
job.setJarByClass(org.myorg.ToolMapReduce.class);
//job.setJarByClass(org.myorg.MySort.class);
job.setJobName("mysort");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
// FileInputFormat.setCompressInput(job,true);
FileOutputFormat.setCompressOutput(job, true);
//FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
// conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString());
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
//job.waitForCompletion(true);
}
}