使用spark scala解压(解压缩/提取)util
我有HDFS中的customer_input_data.tar.gz,它有10个不同的csv文件格式的表数据。因此,我需要使用spark scala将此文件解压缩到/my/output/path使用spark scala解压(解压缩/提取)util,scala,apache-spark,pyspark,apache-spark-sql,Scala,Apache Spark,Pyspark,Apache Spark Sql,我有HDFS中的customer_input_data.tar.gz,它有10个不同的csv文件格式的表数据。因此,我需要使用spark scala将此文件解压缩到/my/output/path 请建议如何使用spark scala解压customer_input_data.tar.gz文件在Hadoop中不是可拆分的格式。因此,该文件实际上不会分布在集群中,并且在hadoop或Spark中也不会从分布式计算/处理中获得任何好处 更好的方法可能是 在操作系统上解压缩文件,然后单独将文件发送回h
请建议如何使用spark scala解压customer_input_data.tar.gz文件在Hadoop中不是可拆分的格式。因此,该文件实际上不会分布在集群中,并且在hadoop或Spark中也不会从分布式计算/处理中获得任何好处 更好的方法可能是
- 在操作系统上解压缩文件,然后单独将文件发送回hadoop
newgzip输入流(newfileinputstream(“您的文件路径”))
我开发了以下代码,用于使用scala解压文件。您需要传递输入路径和输出路径以及Hadoop文件系统
/*below method used for processing zip files*/
@throws[IOException]
private def processTargz(fullpath: String, houtPath: String, fs: FileSystem): Unit = {
val path = new Path(fullpath)
val gzipIn = new GzipCompressorInputStream(fs.open(path))
try {
val tarIn = new TarArchiveInputStream(gzipIn)
try {
var entry:TarArchiveEntry = null
out.println("Tar entry")
out.println("Tar Name entry :" + FilenameUtils.getName(fullpath))
val fileName1 = FilenameUtils.getName(fullpath)
val tarNamesFolder = fileName1.substring(0, fileName1.indexOf('.'))
out.println("Folder Name : " + tarNamesFolder)
while ( {
(entry = tarIn.getNextEntry.asInstanceOf[TarArchiveEntry]) != null
}) { // entity Name as tsv file name which are part of inside compressed tar file
out.println("ENTITY NAME : " + entry.getName)
/** If the entry is a directory, create the directory. **/
out.println("While")
if (entry.isDirectory) {
val f = new File(entry.getName)
val created = f.mkdir
out.println("mkdir")
if (!created) {
out.printf("Unable to create directory '%s', during extraction of archive contents.%n", f.getAbsolutePath)
out.println("Absolute path")
}
}
else {
var count = 0
val slash = "/"
val targetPath = houtPath + slash + tarNamesFolder + slash + entry.getName
val hdfswritepath = new Path(targetPath)
val fos = fs.create(hdfswritepath, true)
try {
val dest = new BufferedOutputStream(fos, BUFFER_SIZE)
try {
val data = new Array[Byte](BUFFER_SIZE)
while ( {
(count = tarIn.read(data, 0, BUFFER_SIZE)) != -1
}) dest.write(data, 0, count)
} finally if (dest != null) dest.close()
}
}
}
out.println("Untar completed successfully!")
} catch {
case e: IOException =>
out.println("catch Block")
} finally {
out.println("FINAL Block")
if (tarIn != null) tarIn.close()
}
}
}
尝试跟随这个链接,它可以给你一些答案:谢谢你发送有用的链接。正在研究我的解决方案。将更新