Scala 使用谓词下推连接两个数据集
我有一个从RDD创建的数据集,并尝试将其与从Phoenix表创建的另一个数据集连接: 当我执行它时,似乎整个数据库表都被加载以进行连接 有没有一种方法可以进行这样的连接,以便在数据库上而不是在spark中进行过滤 另外:Dtojoin比表小,我不知道这是否重要 编辑:基本上,我希望使用通过spark创建的数据集连接我的Phoenix表,而无需将整个表提取到executor中 编辑2:这是实际计划:Scala 使用谓词下推连接两个数据集,scala,apache-spark,hbase,apache-spark-sql,phoenix,Scala,Apache Spark,Hbase,Apache Spark Sql,Phoenix,我有一个从RDD创建的数据集,并尝试将其与从Phoenix表创建的另一个数据集连接: 当我执行它时,似乎整个数据库表都被加载以进行连接 有没有一种方法可以进行这样的连接,以便在数据库上而不是在spark中进行过滤 另外:Dtojoin比表小,我不知道这是否重要 编辑:基本上,我希望使用通过spark创建的数据集连接我的Phoenix表,而无需将整个表提取到executor中 编辑2:这是实际计划: *Project [FEATURE#21, SEQUENCE_IDENTIFIER#22, TAX
*Project [FEATURE#21, SEQUENCE_IDENTIFIER#22, TAX_NUMBER#23,
WINDOW_NUMBER#24, uniqueIdentifier#5, readLength#6]
+- *SortMergeJoin [FEATURE#21], [feature#4], Inner
:- *Sort [FEATURE#21 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(FEATURE#21, 200)
: +- *Filter isnotnull(FEATURE#21)
: +- *Scan PhoenixRelation(FEATURES,localhost,false)
[FEATURE#21,SEQUENCE_IDENTIFIER#22,TAX_NUMBER#23,WINDOW_NUMBER#24]
PushedFilters: [IsNotNull(FEATURE)], ReadSchema:
struct<FEATURE:int,SEQUENCE_IDENTIFIER:string,TAX_NUMBER:int,
WINDOW_NUMBER:int>
+- *Sort [feature#4 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(feature#4, 200)
+- *Filter isnotnull(feature#4)
+- *SerializeFromObject [assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).feature AS feature#4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).uniqueIdentifier, true) AS uniqueIdentifier#5, assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).readLength AS readLength#6]
+- Scan ExternalRDDScan[obj#3]
*项目[功能#21,序列#标识符#22,税号#23,
窗口号24,唯一标识符5,读取长度6]
+-*SortMergeJoin[特征21],[特征4],内部
:-*排序[FEATURE#21 ASC NULLS FIRST],false,0
:+-Exchange哈希分区(功能#21200)
:+-*过滤器不为空(功能#21)
:+-*扫描Phoenix关系(功能、本地主机、错误)
[特征21、序列标识符22、税号23、窗口号24]
PushedFilters:[IsNotNull(功能)],ReadSchema:
结构
+-*排序[feature#4 ASC NULLS FIRST],false,0
+-Exchange哈希分区(功能#4200)
+-*过滤器不为空(功能#4)
+-*SerializeFromObject[assertnotnull(输入[0,utils.CaseClasses$QueryFeature,true],顶级产品输入对象)。功能为功能#4,staticinvoke(类org.apache.spark.safe.types.UTF8String,StringType,fromString,assertnotnull(输入[0,utils.CaseClasses$QueryFeature,true],顶级产品输入对象)。uniqueIdentifier,true)作为唯一标识符#5,assertnotnull(输入[0,utils.CaseClasses$QueryFeature,true],顶级产品输入对象)。readLength作为readLength#6]
+-扫描外部RDDSCAN[obj#3]
正如您所看到的,equals过滤器不包含在pushfilters列表中,因此很明显没有发生谓词下推
Spark将把Phoenix表记录获取给相应的执行器(而不是整个表获取给一个执行器))
由于Phoenix表df上没有直接的过滤器
,我们在物理计划中只看到*过滤器不为空(功能#21)
正如您所提到的,当您对Phoenix表应用过滤器时,它的数据更少。通过在其他数据集中查找
feature\u id
,可以将过滤器推送到feature
列上的phoenix表中
//This spread across workers - fully distributed
val dfToJoin = sparkSession.createDataset(rddToJoin)
//This sits in driver - not distributed
val list_of_feature_ids = dfToJoin.dropDuplicates("feature")
.select("feature")
.map(r => r.getString(0))
.collect
.toList
//This spread across workers - fully distributed
val tableDf = sparkSession
.read
.option("table", "table")
.option("zkURL", "localhost")
.format("org.apache.phoenix.spark")
.load()
.filter($"FEATURE".isin(list_of_feature_ids:_*)) //added filter
//This spread across workers - fully distributed
val joinedDf = dfToJoin.join(tableDf, "columnToJoinOn")
joinedDf.explain()
joinedDf.explain(true)
说明了什么?spark支持在一个步骤中完成这项工作吗?
//This spread across workers - fully distributed
val dfToJoin = sparkSession.createDataset(rddToJoin)
//This sits in driver - not distributed
val list_of_feature_ids = dfToJoin.dropDuplicates("feature")
.select("feature")
.map(r => r.getString(0))
.collect
.toList
//This spread across workers - fully distributed
val tableDf = sparkSession
.read
.option("table", "table")
.option("zkURL", "localhost")
.format("org.apache.phoenix.spark")
.load()
.filter($"FEATURE".isin(list_of_feature_ids:_*)) //added filter
//This spread across workers - fully distributed
val joinedDf = dfToJoin.join(tableDf, "columnToJoinOn")
joinedDf.explain()