Apache spark 在Spark应用程序中读取for循环中的S3文件

Apache spark 在Spark应用程序中读取for循环中的S3文件,apache-spark,Apache Spark,在Spark程序中读取for循环中的文件是否不明智?像这样 for (each file in S3 bucket) RDD <- file transform action for(S3存储桶中的每个文件) RDD我认为您真正想要的是sc.wholeTextFile API: 问候, Olivier。您必须在循环中加载文件,还是尝试将bucket中的所有文件加载到RDD?我的目标是处理bucket中的所有文件,并一次处理一个文件,但都在同一个应用程序中运行 /**

在Spark程序中读取for循环中的文件是否不明智?像这样

for (each file in S3 bucket)
  RDD <- file
  transform
  action
for(S3存储桶中的每个文件)

RDD我认为您真正想要的是sc.wholeTextFile API:

问候,


Olivier。

您必须在循环中加载文件,还是尝试将bucket中的所有文件加载到RDD?我的目标是处理bucket中的所有文件,并一次处理一个文件,但都在同一个应用程序中运行
  /**
   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
   * key-value pair, where the key is the path of each file, the value is the content of each file.
   *
   * <p> For example, if you have the following files:
   * {{{
   *   hdfs://a-hdfs-path/part-00000
   *   hdfs://a-hdfs-path/part-00001
   *   ...
   *   hdfs://a-hdfs-path/part-nnnnn
   * }}}
   *
   * Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
   *
   * <p> then `rdd` contains
   * {{{
   *   (a-hdfs-path/part-00000, its content)
   *   (a-hdfs-path/part-00001, its content)
   *   ...
   *   (a-hdfs-path/part-nnnnn, its content)
   * }}}
   *
   * @note Small files are preferred, large file is also allowable, but may cause bad performance.
   *
   * @param minPartitions A suggestion value of the minimal splitting number for input data.
   */
  def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions):
sc.wholeTextFile("s3://my-directory/2015*/user-134/*")