Apache spark 在Spark应用程序中读取for循环中的S3文件_Apache Spark

Apache spark 在Spark应用程序中读取for循环中的S3文件

apache-spark

Apache spark 在Spark应用程序中读取for循环中的S3文件,apache-spark,Apache Spark,在Spark程序中读取for循环中的文件是否不明智？像这样 for (each file in S3 bucket) RDD <- file transform action for（S3存储桶中的每个文件） RDD我认为您真正想要的是sc.wholeTextFile API：问候, Olivier。您必须在循环中加载文件，还是尝试将bucket中的所有文件加载到RDD？我的目标是处理bucket中的所有文件，并一次处理一个文件，但都在同一个应用程序中运行 /**

在Spark程序中读取for循环中的文件是否不明智？像这样

for (each file in S3 bucket)
  RDD <- file
  transform
  action

for（S3存储桶中的每个文件）
RDD我认为您真正想要的是sc.wholeTextFile API：
问候,
Olivier。您必须在循环中加载文件，还是尝试将bucket中的所有文件加载到RDD？我的目标是处理bucket中的所有文件，并一次处理一个文件，但都在同一个应用程序中运行
  /**
   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
   * key-value pair, where the key is the path of each file, the value is the content of each file.
   *
   * <p> For example, if you have the following files:
   * {{{
   *   hdfs://a-hdfs-path/part-00000
   *   hdfs://a-hdfs-path/part-00001
   *   ...
   *   hdfs://a-hdfs-path/part-nnnnn
   * }}}
   *
   * Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
   *
   * <p> then `rdd` contains
   * {{{
   *   (a-hdfs-path/part-00000, its content)
   *   (a-hdfs-path/part-00001, its content)
   *   ...
   *   (a-hdfs-path/part-nnnnn, its content)
   * }}}
   *
   * @note Small files are preferred, large file is also allowable, but may cause bad performance.
   *
   * @param minPartitions A suggestion value of the minimal splitting number for input data.
   */
  def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions):

sc.wholeTextFile("s3://my-directory/2015*/user-134/*")