Apache spark 如何在Spark 3中查看推送过滤器和分区过滤器

Apache spark 如何在Spark 3中查看推送过滤器和分区过滤器,apache-spark,Apache Spark,如何在Spark 3(3.0.0-preview2)中查看分区过滤器和推送过滤器 explain方法输出的细节如Spark 2中所示: == Physical Plan == Project [first_name#12, last_name#13, country#14] +- Filter (((isnotnull(country#14) && isnotnull(first_name#12)) && (country#14 = Russia)) &&

如何在Spark 3(3.0.0-preview2)中查看分区过滤器和推送过滤器

explain
方法输出的细节如Spark 2中所示:

== Physical Plan ==
Project [first_name#12, last_name#13, country#14]
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12)) && (country#14 = Russia)) && StartsWith(first_name#12, M))
   +- FileScan csv [first_name#12,last_name#13,country#14]
        Batched: false,
        Format: CSV,
        Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/people.csv],
        PartitionFilters: [],
        PushedFilters: [IsNotNull(country), IsNotNull(first_name), EqualTo(country,Russia), StringStartsWith(first_name,M)],
        ReadSchema: struct
这将很容易让您识别PartitionFilters和PushedFilters

在Spark 3中,解释要少得多,即使设置了
extended
参数:

val path = new java.io.File("./src/test/resources/person_data.csv").getCanonicalPath
val df = spark.read.option("header", "true").csv(path)
df
  .filter(col("person_country") === "Cuba")
  .explain("extended")
以下是输出:

== Parsed Logical Plan ==
'Filter ('person_country = Cuba)
+- RelationV2[person_name#115, person_country#116] csv file:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/resources/person_data.csv

== Analyzed Logical Plan ==Only 18s
person_name: string, person_country: string
Filter (person_country#116 = Cuba)
+- RelationV2[person_name#115, person_country#116] csv file:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/resources/person_data.csv

== Optimized Logical Plan ==
Filter (isnotnull(person_country#116) AND (person_country#116 = Cuba))
+- RelationV2[person_name#115, person_country#116] csv file:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/resources/person_data.csv

== Physical Plan ==
*(1) Project [person_name#115, person_country#116]
+- *(1) Filter (isnotnull(person_country#116) AND (person_country#116 = Cuba))
   +- BatchScan[person_name#115, person_country#116] CSVScan Location: InMemoryFileIndex[file:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/re..., ReadSchema: struct<person_name:string,person_country:string>
==解析的逻辑计划==
'过滤器('person\u country=Cuba)
+-RelationV2[人名115,人名国家116]csv文件:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/resources/person_data.csv
==分析的逻辑计划==仅18秒
人名:字符串,人名国家:字符串
过滤器(个人#国家#116=古巴)
+-RelationV2[人名115,人名国家116]csv文件:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/resources/person_data.csv
==优化的逻辑计划==
过滤器(isnotnull(person#u country#116)和(person#u country#116=Cuba))
+-RelationV2[人名115,人名国家116]csv文件:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/resources/person_data.csv
==实际计划==
*(1) 项目[个人姓名115,个人国家116]
+-*(1)过滤器(isnotnull(person#u country#116)和(person#u country#116=Cuba))
+-BatchScan[person_name#115,person_country#116]CSVScan位置:InMemoryFileIndex[文件:/Users/MatthewPower/Documents/code/my_apps/mungingdata/spark3/src/test/re…,ReadSchema:struct

有没有办法查看Spark 3中的分区过滤器和推送过滤器?

这似乎是一个在四月底修复的bug。谓词下推的JIRA为,分区下推的JIRA为


您能检查一下您的Spark版本中是否包含此修复程序吗?

我正在使用3.0.0-preview2,这是目前最新的版本。该版本于2019年12月发布。最新的预览版本是Spark 3.0.0-preview2,发布于2019年12月23日。这些修复程序是在稍后完成的,我相信将在5月底推出一个版本,其中包含这些修复程序问题将得到解决。