Scala AWS S3中的FileUtil.copyMerge（）_Scala_Hadoop_Apache Spark_Amazon S3_Hdfs

Scala AWS S3中的FileUtil.copyMerge（）

scala hadoop apache-spark amazon-s3

Scala AWS S3中的FileUtil.copyMerge（）,scala,hadoop,apache-spark,amazon-s3,hdfs,Scala,Hadoop,Apache Spark,Amazon S3,Hdfs,我已经使用下面的代码将数据帧加载到HDFS中作为文本格式finalDataFrame是DataFrame finalDataFrame.repartition(1).rdd.saveAsTextFile(targetFile) 执行上述代码后，我发现使用我提供的文件名创建了一个目录，在该目录下创建了一个文件，但不是文本格式。文件名类似于第-00000部分我已经在HDFS中使用下面的代码解决了这个问题 val hadoopConfig = new Configuration() val hdf

我已经使用下面的代码将

数据帧

加载到

HDFS

中作为

文本

格式

finalDataFrame

是

DataFrame

finalDataFrame.repartition(1).rdd.saveAsTextFile(targetFile)

执行上述代码后，我发现使用我提供的文件名创建了一个目录，在该目录下创建了一个文件，但不是文本格式。文件名类似于第-00000部分

我已经在

HDFS

中使用下面的代码解决了这个问题

val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)

def createOutputTextFile(srcPath: String, dstPath: String, s3BucketPath: String): Unit = {
    var fileSystem: FileSystem = null
    var conf: Configuration = null
    if (srcPath.toLowerCase().contains("s3a") || srcPath.toLowerCase().contains("s3n")) {
      conf = sc.hadoopConfiguration
      fileSystem = FileSystem.get(new URI(s3BucketPath), conf)
    } else {
      conf = new Configuration()
      fileSystem = FileSystem.get(conf)
    }
    FileUtil.copyMerge(fileSystem, new Path(srcPath), fileSystem, new Path(dstPath), true, conf, null)
  }

现在，我可以使用给定的文件名获取上述路径中的文本文件

但是当我尝试在S3中执行相同的操作时，它显示了一些异常

FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)

java.lang.IllegalArgumentException: Wrong FS:
s3a://globalhadoop/data, expected:
hdfs://*********.aws.*****.com:8050

这里似乎不支持S3路径。任何人都可以帮助执行此操作。

您正在将hdfs文件系统作为目标FS传入

FileUtil.copyMerge

。您需要获取目标的真实FS，您可以通过在您创建的目标路径上调用

Path.getFileSystem（Configuration）

来实现这一点。

我使用以下代码解决了这个问题

val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)

def createOutputTextFile(srcPath: String, dstPath: String, s3BucketPath: String): Unit = {
    var fileSystem: FileSystem = null
    var conf: Configuration = null
    if (srcPath.toLowerCase().contains("s3a") || srcPath.toLowerCase().contains("s3n")) {
      conf = sc.hadoopConfiguration
      fileSystem = FileSystem.get(new URI(s3BucketPath), conf)
    } else {
      conf = new Configuration()
      fileSystem = FileSystem.get(conf)
    }
    FileUtil.copyMerge(fileSystem, new Path(srcPath), fileSystem, new Path(dstPath), true, conf, null)
  }

我已经为S3和HDFS的文件系统编写了代码，两者都运行良好。

创建了一个文件，但不是文本格式。文件名类似于-00000部分-它是文本格式。只需检查它的内容。我想要.txt格式的文件，不需要任何目录。在HDFS中完成此操作。S3也需要这样。这太复杂了。使用新路径（s3BucketPath）创建一个路径，然后转到path.getFileSystem（sc.hadoopConfiguration），而不管目标路径是什么。你会得到你需要的任何东西