Apache spark Spark列出所有叶节点，即使在分区数据中也是如此_Apache Spark_Amazon S3_Apache Spark Sql_Partitioning_Parquet

Apache spark Spark列出所有叶节点，即使在分区数据中也是如此

apache-spark amazon-s3

Apache spark Spark列出所有叶节点，即使在分区数据中也是如此,apache-spark,amazon-s3,apache-spark-sql,partitioning,parquet,Apache Spark,Amazon S3,Apache Spark Sql,Partitioning,Parquet,我将拼花地板数据按日期和小时进行分区，文件夹结构： events_v3 -- event_date=2015-01-01 -- event_hour=2015-01-1 -- part10000.parquet.gz -- event_date=2015-01-02 -- event_hour=5 -- part10000.parquet.gz 我通过spark创建了一个表raw\u events，但当我尝试查询时，它会扫描所有目录中的页脚，这会

我将拼花地板数据按

日期

和

小时

进行分区，文件夹结构：

events_v3
  -- event_date=2015-01-01
    -- event_hour=2015-01-1
      -- part10000.parquet.gz
  -- event_date=2015-01-02
    -- event_hour=5
      -- part10000.parquet.gz

我通过spark创建了一个表

raw\u events

，但当我尝试查询时，它会扫描所有目录中的页脚，这会减慢初始查询的速度，即使我只查询了一天的数据

查询：

从原始事件中选择*，其中事件日期=2016-01-01'

类似的问题：（但它是旧的）

日志：

因为有350天的数据，所以它产生了350个任务

我已经禁用了

schemaMerge

，还指定了要读取的模式，因此它可以直接转到我正在查看的分区，为什么它要打印所有叶文件？列出包含2个执行器的叶文件需要10分钟，查询实际执行需要20秒

代码示例：

val sparkSession = org.apache.spark.sql.SparkSession.builder.getOrCreate()
val df = sparkSession.read.option("mergeSchema","false").format("parquet").load("s3a://bucket/events_v3")
    df.createOrReplaceTempView("temp_events")
    sparkSession.sql(
      """
        |select verb,count(*) from temp_events where event_date = "2016-01-01" group by verb
      """.stripMargin).show()

一旦给spark一个要读取的目录，它就会发出对

listLeafFiles

（org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala）的调用。这会依次调用

fs.listStatus

，从而进行api调用以获取文件和目录列表。现在，对于每个目录，将再次调用此方法。这将递归出现，直到没有目录留下。这种设计在HDFS系统中运行良好。但在s3中效果不好，因为列表文件是一个RPC调用。其他版本的S3支持按前缀获取所有文件，这正是我们需要的

例如，如果我们有上面的目录结构，每个目录每小时有1年的数据，每个目录有10个子目录，365*24*10=87k的api调用，如果只有137000个文件，那么这个api调用可以减少到138个。每个S3API调用返回1000个文件

代码：

org/apache/hadoop/fs/s3a/S3AFileSystem.java

public FileStatus[]ListStatus递归（路径f）抛出FileNotFoundException，
IOException{
字符串键=路径键（f）；
if（LOG.isDebugEnabled（））{
LOG.debug（“路径的列表状态：+f”）；
}
最终列表结果=新建ArrayList（）；
final FileStatus FileStatus=getFileStatus（f）；
if（fileStatus.isDirectory（））{
如果（！key.isEmpty（））{
键=键+“/”；
}
ListObjectsRequest请求=新建ListObjectsRequest（）；
请求.setBucketName（bucket）；
request.setPrefix（key）；
请求。设置maxKeys（maxKeys）；
if（LOG.isDebugEnabled（））{
LOG.debug（“listStatus:为目录执行listObjects”+键）；
}
ObjectListing objects=s3.listObjects（请求）；
统计数据。递增读取操作（1）；
while（true）{
对于（S3ObjectSummary摘要：objects.getObjectSummaries（））{
Path keyPath=keyToPath（summary.getKey（））.makeQualified（uri，workingDir）；
//跳过属于我们自己和旧S3N_$folder$文件的密钥
if（keyPath.equals（f）| | summary.getKey（）.endsWith（S3N_FOLDER_后缀））{
if（LOG.isDebugEnabled（））{
LOG.debug（“忽略：“+keyPath”）；
}
继续；
}
if（objectRepresentsDirectory（summary.getKey（），summary.getSize（））{
添加（新的S3AFileStatus（true、true、keyPath））；
if（LOG.isDebugEnabled（））{
LOG.debug（“添加：fd:+keyPath”）；
}
}否则{
添加新的S3AFileStatus（summary.getSize（），
dateToLong（summary.getLastModified（）），键路径，
getDefaultBlockSize（f.makeQualified（uri，workingDir））；
if（LOG.isDebugEnabled（））{
LOG.debug（“添加：fi:+keyPath”）；
}
}
}
for（字符串前缀：objects.getCommonPrefixes（））{
Path keyPath=keyToPath（前缀）.makeQualified（uri，workingDir）；
if（键路径等于（f））{
继续；
}
添加（新的S3AFileStatus（true、false、keyPath））；
if（LOG.isDebugEnabled（））{
LOG.debug（“添加：rd:+keyPath”）；
}
}
if（objects.isTruncated（））{
if（LOG.isDebugEnabled（））{
调试（“listStatus:列表被截断-获取下一批”）；
}
objects=s3.listenxtbatchofobjects（objects）；
统计数据。递增读取操作（1）；
}否则{
打破
}
}
}否则{
if（LOG.isDebugEnabled（））{
LOG.debug（“添加：rd（非目录）：”+f）；
}
result.add（fileStatus）；
}
返回result.toArray（新文件状态[result.size（）]）；
}

/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala

def listLeafFiles（fs:FileSystem，status:FileStatus，filter:PathFilter）：数组[FileStatus]={
日志跟踪（s“Listing${status.getPath}”）
val name=status.getPath.getName.toLowerCase
if（shouldFilterOut（名称））{
Array.empty[FileStatus]
}
否则{
val状态={
val stats=if（fs.isInstanceOf[S3AFileSystem]）{
logWarning（“使用列表状态的猴子补丁版本”）
println（“使用列表状态的猴子补丁版本”）
val a=fs.asInstanceOf[S3AFileSystem].ListStatus递归（status.getPath）
A.
//Array.empty[FileStatus]
}
否则{
val（dirs，files）=fs.listStatus（status.getPath）.partit
val sparkSession = org.apache.spark.sql.SparkSession.builder.getOrCreate()
val df = sparkSession.read.option("mergeSchema","false").format("parquet").load("s3a://bucket/events_v3")
    df.createOrReplaceTempView("temp_events")
    sparkSession.sql(
      """
        |select verb,count(*) from temp_events where event_date = "2016-01-01" group by verb
      """.stripMargin).show()