Google api 如何通过hadoop群集为Google Compute Engine启用Snappy/Snappy编解码器
我正试图在谷歌计算引擎上运行Hadoop作业,对我们的压缩数据进行处理,这些数据位于谷歌云存储上。 尝试通过SequenceFileInputFormat读取数据时,出现以下异常:Google api 如何通过hadoop群集为Google Compute Engine启用Snappy/Snappy编解码器,google-api,google-api-java-client,google-compute-engine,snappy,google-hadoop,Google Api,Google Api Java Client,Google Compute Engine,Snappy,Google Hadoop,我正试图在谷歌计算引擎上运行Hadoop作业,对我们的压缩数据进行处理,这些数据位于谷歌云存储上。 尝试通过SequenceFileInputFormat读取数据时,出现以下异常: hadoop@hadoop-m:/home/salikeeno$ hadoop jar ${JAR} ${PROJECT} ${OUTPUT_TABLE} 14/08/21 19:56:00 INFO jaws.JawsApp: Using export bucket 'askbuckerthroughhadoop'
hadoop@hadoop-m:/home/salikeeno$ hadoop jar ${JAR} ${PROJECT} ${OUTPUT_TABLE}
14/08/21 19:56:00 INFO jaws.JawsApp: Using export bucket 'askbuckerthroughhadoop' as specified in 'mapred.bq.gcs.bucket'
14/08/21 19:56:00 INFO bigquery.BigQueryConfiguration: Using specified project-id 'regal-campaign-641' for output
14/08/21 19:56:00 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.8-hadoop1
14/08/21 19:56:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/08/21 19:56:03 INFO input.FileInputFormat: Total input paths to process : 1
14/08/21 19:56:09 INFO mapred.JobClient: Running job: job_201408211943_0002
14/08/21 19:56:10 INFO mapred.JobClient: map 0% reduce 0%
14/08/21 19:56:20 INFO mapred.JobClient: Task Id : attempt_201408211943_0002_m_000001_0, Status : FAILED
java.lang.RuntimeException: native snappy library not available
at org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:189)
at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1581)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1490)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:521)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
hadoop@hadoop-m:/home/salikeeno$hadoop jar${jar}${PROJECT}${OUTPUT_TABLE}
14/08/21 19:56:00 INFO jaws.JawsApp:使用“mapred.bq.gcs.bucket”中指定的导出bucket“askbuckerthroughhadoop”
14/08/21 19:56:00信息bigquery.BigQueryConfiguration:使用指定的项目id“regal-campaign-641”进行输出
14/08/21 19:56:00 INFO gcs.googlehadoop文件系统数据库:GHFS版本:1.2.8-hadoop1
14/08/21 19:56:01 WARN mapred.JobClient:使用GenericOptionsParser解析参数。应用程序应该为相同的应用程序实现工具。
14/08/21 19:56:03信息输入。文件输入格式:处理的总输入路径:1
14/08/21 19:56:09信息映射。作业客户端:正在运行作业:作业\u 201408211943\u 0002
14/08/21 19:56:10信息映射。作业客户端:映射0%减少0%
14/08/21 19:56:20 INFO mapred.JobClient:任务Id:尝试\u 201408211943\u 0002\u m\u000001\u 0,状态:失败
java.lang.RuntimeException:本机snappy库不可用
位于org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:189)
位于org.apache.hadoop.io.compress.codepool.getDecompressor(codepool.java:125)
位于org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1581)
位于org.apache.hadoop.io.SequenceFile$Reader。(SequenceFile.java:1490)
位于org.apache.hadoop.io.SequenceFile$Reader。(SequenceFile.java:1479)
位于org.apache.hadoop.io.SequenceFile$Reader。(SequenceFile.java:1474)
位于org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
位于org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:521)
位于org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
位于org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
位于org.apache.hadoop.mapred.Child$4.run(Child.java:255)
位于java.security.AccessController.doPrivileged(本机方法)
位于javax.security.auth.Subject.doAs(Subject.java:415)
位于org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
位于org.apache.hadoop.mapred.Child.main(Child.java:249)
./bdutil -e extensions/spark/spark_env.sh deploy
关于第一个和第二个问题:在GCE上处理Hadoop中的Snappy时有两个障碍。首先,由Apache构建并与Hadoop 2 tarballs捆绑的本机支持库是为i386构建的,而GCE实例是amd64。Hadoop 1为这两种平台绑定了二进制文件,但是如果不绑定或修改环境,snappy就无法定位。由于这种架构差异,Hadoop 2中没有可用的本机压缩器(snappy或其他),而snappy在Hadoop 1中也不容易使用。第二个障碍是默认情况下未安装libsnapy本身
克服这两个问题的最简单方法是创建自己的Hadoop tarball,其中包含amd64本机Hadoop库以及libsnapy。下面的步骤应该可以帮助您完成这项工作,并将生成的tarball提交给bdutil使用
首先,使用Debian Wheezy backports映像启动一个新的GCE VM,并授予VM服务帐户对云存储的读/写访问权。我们将使用它作为构建机器,在构建/存储完二进制文件后,我们可以安全地丢弃它
使用Snappy构建Hadoop 1.2.1
SSH连接到新实例并运行以下命令,同时检查是否存在任何错误:
sudo apt-get update
sudo apt-get install pkg-config libsnappy-dev libz-dev libssl-dev gcc make cmake automake autoconf libtool g++ openjdk-7-jdk maven ant
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
wget http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
tar zxvf hadoop-1.2.1.tar.gz
pushd hadoop-1.2.1/
# Bundle libsnappy so we don't have to apt-get install it on each machine
cp /usr/lib/libsnappy* lib/native/Linux-amd64-64/
# Test to make certain Snappy is being loaded and is working:
bin/hadoop jar ./hadoop-test-1.2.1.jar testsequencefile -seed 0 -count 1000 -compressType RECORD xxx -codec org.apache.hadoop.io.compress.SnappyCodec -check
# Create a new tarball of Hadoop 1.2.1:
popd
rm hadoop-1.2.1.tar.gz
tar zcvf hadoop-1.2.1.tar.gz hadoop-1.2.1/
# Store the tarball on GCS:
gsutil cp hadoop-1.2.1.tar.gz gs://<some bucket>/hadoop-1.2.1.tar.gz
sudo apt-get update
sudo apt-get install pkg-config libsnappy-dev libz-dev libssl-dev gcc make cmake automake autoconf libtool g++ openjdk-7-jdk maven ant
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
# Protobuf 2.5.0 is required and not in Debian-backports
wget http://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz
tar xvf protobuf-2.5.0.tar.gz
pushd protobuf-2.5.0/ && ./configure && make && sudo make install && popd
sudo ldconfig
wget http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz
# Unpack source
tar zxvf hadoop-2.4.1-src.tar.gz
pushd hadoop-2.4.1-src
# Build Hadoop
mvn package -Pdist,native -DskipTests -Dtar
pushd hadoop-dist/target/
pushd hadoop-2.4.1/
# Bundle libsnappy so we don't have to apt-get install it on each machine
cp /usr/lib/libsnappy* lib/native/
# Test that everything is working:
bin/hadoop jar share/hadoop/common/hadoop-common-2.4.1-tests.jar org.apache.hadoop.io.TestSequenceFile -seed 0 -count 1000 -compressType RECORD xxx -codec org.apache.hadoop.io.compress.SnappyCodec -check
popd
# Create a new tarball with libsnappy:
rm hadoop-2.4.1.tar.gz
tar zcf hadoop-2.4.1.tar.gz hadoop-2.4.1/
# Store the new tarball on GCS:
gsutil cp hadoop-2.4.1.tar.gz gs://<some bucket>/hadoop-2.4.1.tar.gz
popd
popd
并将指向的URI更改为我们在上面存储tarball的URI:例如
HADOOP_TARBALL_URI='gs://<some bucket>/hadoop-1.2.1.tar.gz'
HADOOP\u TARBALL\u URI='gs:///HADOOP-1.2.1.tar.gz'
不需要在Hadoop 1.2.1中编译本机lib,因为它们实际上与发行版捆绑在一起。我已经从1.2.1节中删除了该步骤,并更新了导致该步骤的段落。捆绑libsnappy和准备你自己的tarball仍然是(目前为止)最简单的方法。非常感谢Angus的详细回复和完美的命令,我按照您建议的步骤成功地使用SnappyCodec部署了hadoop。@AngusDavis我可以在我的windows机器上使用这些Snappy吗?@GopsAB看起来Snappy在windows上受支持,但我自己确实没有尝试过。有关如何测试/构建它的详细信息,请参阅。
HADOOP_TARBALL_URI='gs://<some bucket>/hadoop-1.2.1.tar.gz'