Scala 在不更改Spark属性的情况下进行连接时未广播数据帧的示例

Scala 在不更改Spark属性的情况下进行连接时未广播数据帧的示例,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,根据,如果这是一个“Hive Metastore table”,并且DataFrame大小小于10MB,则默认情况下会广播一个小的DataFrame 如何在localspark shell中创建尚未计算统计信息的表 到目前为止,我用spark.read.csv、Seq((“SOF”).toDF(“name”)和spark.range(1000) 以下所有数据帧都导致了广播连接 如何在本地spark群集上将spark外壳中的属性spark.sql.autoBroadcastJoinThreshol

根据,如果这是一个“Hive Metastore table”,并且
DataFrame
大小小于
10MB
,则默认情况下会广播一个小的
DataFrame

如何在local
spark shell
中创建尚未计算统计信息的表

到目前为止,我用
spark.read.csv
Seq((“SOF”).toDF(“name”)
spark.range(1000)

以下所有数据帧都导致了
广播连接


如何在本地spark群集上将
spark外壳中的
属性
spark.sql.autoBroadcastJoinThreshold
更改为
-1
时,制作一个“小”数据帧(大小小于
10MB
)而不进行广播,该功能如下:

scala> val df1 = spark.range(10)
df1: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val df2 = spark.range(100)
df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df1.join(df2, Seq("Id")).explain
== Physical Plan ==
*(2) Project [id#232L]
+- *(2) BroadcastHashJoin [id#232L], [id#234L], Inner, BuildLeft
   :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   :  +- *(1) Range (0, 10, step=1, splits=8)
   +- *(2) Range (0, 100, step=1, splits=8)

scala> val df1 = spark.range(10000000)
df1: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df1.join(df2, Seq("Id")).explain
== Physical Plan ==
*(2) Project [id#238L]
+- *(2) BroadcastHashJoin [id#238L], [id#234L], Inner, BuildRight
   :- *(2) Range (0, 10000000, step=1, splits=8)
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
      +- *(1) Range (0, 100, step=1, splits=8)

scala> val df2 = spark.range(10000000)
df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df1.join(df2, Seq("Id")).explain
== Physical Plan ==
*(5) Project [id#238L]
+- *(5) SortMergeJoin [id#238L], [id#242L], Inner
   :- *(2) Sort [id#238L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#238L, 200)
   :     +- *(1) Range (0, 10000000, step=1, splits=8)
   +- *(4) Sort [id#242L ASC NULLS FIRST], false, 0
      +- ReusedExchange [id#242L], Exchange hashpartitioning(id#238L, 200)

scala>

在我的本地spark cluster上,此功能有效:

scala> val df1 = spark.range(10)
df1: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val df2 = spark.range(100)
df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df1.join(df2, Seq("Id")).explain
== Physical Plan ==
*(2) Project [id#232L]
+- *(2) BroadcastHashJoin [id#232L], [id#234L], Inner, BuildLeft
   :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   :  +- *(1) Range (0, 10, step=1, splits=8)
   +- *(2) Range (0, 100, step=1, splits=8)

scala> val df1 = spark.range(10000000)
df1: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df1.join(df2, Seq("Id")).explain
== Physical Plan ==
*(2) Project [id#238L]
+- *(2) BroadcastHashJoin [id#238L], [id#234L], Inner, BuildRight
   :- *(2) Range (0, 10000000, step=1, splits=8)
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
      +- *(1) Range (0, 100, step=1, splits=8)

scala> val df2 = spark.range(10000000)
df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df1.join(df2, Seq("Id")).explain
== Physical Plan ==
*(5) Project [id#238L]
+- *(5) SortMergeJoin [id#238L], [id#242L], Inner
   :- *(2) Sort [id#238L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#238L, 200)
   :     +- *(1) Range (0, 10000000, step=1, splits=8)
   +- *(4) Sort [id#242L ASC NULLS FIRST], false, 0
      +- ReusedExchange [id#242L], Exchange hashpartitioning(id#238L, 200)

scala>

谢谢你的回答!不幸的是,我更多地考虑的是一个小的
数据帧
无法得到广播。我怀疑
spark.range(10000000
10MB
重。这不是我想要的,我会澄清我的问题。谢谢你,如果数据帧符合autoBroadcastJoinThreshold设置的10MB值,它不会一直被广播吗?我想强制SortMergeJoin的唯一方法是禁用此值。你可以设置
spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”,“-1”)
首先运行代码,然后将其设置回默认值。显然,DF需要满足两个条件,即
10MB
1和”请注意,当前仅支持已运行命令ANALYZE TABLE COMPUTE statistics noscan的配置单元元存储表的统计信息。“。我不完全理解这一点,我想找一个示例,说明此语句不正确且数据帧位于
10MB
:)谢谢你的回答!不幸的是,我更多地考虑的是一个小的
数据帧
没有得到广播。我怀疑
spark.range(10000000
10MB
重。这不是我想要的,我会澄清我的问题。谢谢你,如果数据帧符合autoBroadcastJoinThreshold设置的10MB值,它不会一直被广播吗?我想强制SortMergeJoin的唯一方法是禁用此值。你可以设置
spark.conf.set(“spark.sql.autoBroadcastJoinThreshold“,“-1”)
首先运行代码,然后将其设置回默认值。显然,DF需要满足两个条件,
10MB
1和请注意,当前仅支持已运行命令ANALYZE TABLE COMPUTE statistics noscan的配置单元元存储表的统计信息。“。我不完全理解这一点,我想找一个示例,说明此语句不正确且数据帧位于
10MB
:)