Hadoop 将大文件复制到HDFS中
我正在尝试将一个大文件(32GB)复制到HDFS中。我从来没有在HDFS中复制文件时遇到过任何问题,但这些文件都更小。我使用的是Hadoop 将大文件复制到HDFS中,hadoop,hdfs,ioexception,Hadoop,Hdfs,Ioexception,我正在尝试将一个大文件(32GB)复制到HDFS中。我从来没有在HDFS中复制文件时遇到过任何问题,但这些文件都更小。我使用的是hadoop fs-put,高达13,7 GB,一切都进行得很顺利,但我遇到了一个例外: hadoop fs -put * /data/unprocessed/ Exception in thread "main" org.apache.hadoop.fs.FSError: java.io.IOException: Input/output error
hadoop fs-put
,高达13,7 GB,一切都进行得很顺利,但我遇到了一个例外:
hadoop fs -put * /data/unprocessed/
Exception in thread "main" org.apache.hadoop.fs.FSError: java.io.IOException: Input/output error
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:150)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:384)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:217)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:230)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:191)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1183)
at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:130)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1762)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1895)
Caused by: java.io.IOException: Input/output error
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:242)
at org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.read(RawLocalFileSystem.java:91)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:144)
... 20 more
当我检查日志文件(在我的NameNode和DataNodes上)时,我看到文件上的租约已被删除,但没有指定原因。根据日志文件,一切进展顺利。以下是我的NameNode日志的最后几行:
2013-01-28 09:43:34,176 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /data/unprocessed/AMR_EXPORT.csv. blk_-4784588526865920213_1001
2013-01-28 09:44:16,459 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.6.114:50010 is added to blk_-4784588526865920213_1001 size 30466048
2013-01-28 09:44:16,466 INFO org.apache.hadoop.hdfs.StateChange: Removing lease on file /data/unprocessed/AMR_EXPORT.csv from client DFSClient_1738322483
2013-01-28 09:44:16,472 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /data/unprocessed/AMR_EXPORT.csv is closed by DFSClient_1738322483
2013-01-28 09:44:16,517 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 168 Total time for transactions(ms): 26Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0
有人对此有线索吗?我已经检查了core default.xml
和hdfs default.xml
,查找我可以覆盖的可以延长租约的属性,但找不到这些属性。一些建议:
- 如果要复制多个文件,请使用多个put会话
- 如果只有一个大文件,则在复制之前使用压缩,或者可以将大文件拆分为小文件,然后进行复制
split-b4096m
来拆分它,效果很好。后一种方法的唯一问题是,目前我正在3台旧笔记本电脑上测试Hadoop,它们有60GB的硬盘,因此没有空间分割32GB的文件。另一方面,Hadoop应该能够处理这种大小的文件,不是吗?就这一点而言,我曾尝试使用-copyFromLocal复制1 TB的文件,它对我来说没有任何问题,但我这里有300多个节点群集:-)在一次绝望的尝试中,我只是再次尝试,出于某种奇怪的原因,我成功地复制了该文件。我以前不知道是什么导致了错误,但我使用-copyFromLocal
和-put
尝试了近10次。该文件现在在HDFS中,但我很确定如果我再试一次,它将失败。