Apache spark 在Spark应用程序中读取for循环中的S3文件
在Spark程序中读取for循环中的文件是否不明智?像这样Apache spark 在Spark应用程序中读取for循环中的S3文件,apache-spark,Apache Spark,在Spark程序中读取for循环中的文件是否不明智?像这样 for (each file in S3 bucket) RDD <- file transform action for(S3存储桶中的每个文件) RDD我认为您真正想要的是sc.wholeTextFile API: 问候, Olivier。您必须在循环中加载文件,还是尝试将bucket中的所有文件加载到RDD?我的目标是处理bucket中的所有文件,并一次处理一个文件,但都在同一个应用程序中运行 /**
for (each file in S3 bucket)
RDD <- file
transform
action
for(S3存储桶中的每个文件)
RDD我认为您真正想要的是sc.wholeTextFile API:
问候,
Olivier。您必须在循环中加载文件,还是尝试将bucket中的所有文件加载到RDD?我的目标是处理bucket中的所有文件,并一次处理一个文件,但都在同一个应用程序中运行
/**
* Read a directory of text files from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI. Each file is read as a single record and returned in a
* key-value pair, where the key is the path of each file, the value is the content of each file.
*
* <p> For example, if you have the following files:
* {{{
* hdfs://a-hdfs-path/part-00000
* hdfs://a-hdfs-path/part-00001
* ...
* hdfs://a-hdfs-path/part-nnnnn
* }}}
*
* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
*
* <p> then `rdd` contains
* {{{
* (a-hdfs-path/part-00000, its content)
* (a-hdfs-path/part-00001, its content)
* ...
* (a-hdfs-path/part-nnnnn, its content)
* }}}
*
* @note Small files are preferred, large file is also allowable, but may cause bad performance.
*
* @param minPartitions A suggestion value of the minimal splitting number for input data.
*/
def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions):
sc.wholeTextFile("s3://my-directory/2015*/user-134/*")