Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 使用谓词下推连接两个数据集_Scala_Apache Spark_Hbase_Apache Spark Sql_Phoenix - Fatal编程技术网

Scala 使用谓词下推连接两个数据集

Scala 使用谓词下推连接两个数据集,scala,apache-spark,hbase,apache-spark-sql,phoenix,Scala,Apache Spark,Hbase,Apache Spark Sql,Phoenix,我有一个从RDD创建的数据集,并尝试将其与从Phoenix表创建的另一个数据集连接: 当我执行它时,似乎整个数据库表都被加载以进行连接 有没有一种方法可以进行这样的连接,以便在数据库上而不是在spark中进行过滤 另外:Dtojoin比表小,我不知道这是否重要 编辑:基本上,我希望使用通过spark创建的数据集连接我的Phoenix表,而无需将整个表提取到executor中 编辑2:这是实际计划: *Project [FEATURE#21, SEQUENCE_IDENTIFIER#22, TAX

我有一个从RDD创建的数据集,并尝试将其与从Phoenix表创建的另一个数据集连接:

当我执行它时,似乎整个数据库表都被加载以进行连接

有没有一种方法可以进行这样的连接,以便在数据库上而不是在spark中进行过滤

另外:Dtojoin比表小,我不知道这是否重要

编辑:基本上,我希望使用通过spark创建的数据集连接我的Phoenix表,而无需将整个表提取到executor中

编辑2:这是实际计划:

*Project [FEATURE#21, SEQUENCE_IDENTIFIER#22, TAX_NUMBER#23, 
         WINDOW_NUMBER#24, uniqueIdentifier#5, readLength#6]
 +- *SortMergeJoin [FEATURE#21], [feature#4], Inner
     :- *Sort [FEATURE#21 ASC NULLS FIRST], false, 0
     :  +- Exchange hashpartitioning(FEATURE#21, 200)
     :     +- *Filter isnotnull(FEATURE#21)
     :        +- *Scan PhoenixRelation(FEATURES,localhost,false) 

    [FEATURE#21,SEQUENCE_IDENTIFIER#22,TAX_NUMBER#23,WINDOW_NUMBER#24] 
    PushedFilters: [IsNotNull(FEATURE)], ReadSchema: 

    struct<FEATURE:int,SEQUENCE_IDENTIFIER:string,TAX_NUMBER:int,
    WINDOW_NUMBER:int>
   +- *Sort [feature#4 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(feature#4, 200)
     +- *Filter isnotnull(feature#4)
        +- *SerializeFromObject [assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).feature AS feature#4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).uniqueIdentifier, true) AS uniqueIdentifier#5, assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).readLength AS readLength#6]
           +- Scan ExternalRDDScan[obj#3]
*项目[功能#21,序列#标识符#22,税号#23,
窗口号24,唯一标识符5,读取长度6]
+-*SortMergeJoin[特征21],[特征4],内部
:-*排序[FEATURE#21 ASC NULLS FIRST],false,0
:+-Exchange哈希分区(功能#21200)
:+-*过滤器不为空(功能#21)
:+-*扫描Phoenix关系(功能、本地主机、错误)
[特征21、序列标识符22、税号23、窗口号24]
PushedFilters:[IsNotNull(功能)],ReadSchema:
结构
+-*排序[feature#4 ASC NULLS FIRST],false,0
+-Exchange哈希分区(功能#4200)
+-*过滤器不为空(功能#4)
+-*SerializeFromObject[assertnotnull(输入[0,utils.CaseClasses$QueryFeature,true],顶级产品输入对象)。功能为功能#4,staticinvoke(类org.apache.spark.safe.types.UTF8String,StringType,fromString,assertnotnull(输入[0,utils.CaseClasses$QueryFeature,true],顶级产品输入对象)。uniqueIdentifier,true)作为唯一标识符#5,assertnotnull(输入[0,utils.CaseClasses$QueryFeature,true],顶级产品输入对象)。readLength作为readLength#6]
+-扫描外部RDDSCAN[obj#3]
正如您所看到的,equals过滤器不包含在pushfilters列表中,因此很明显没有发生谓词下推

Spark将把Phoenix表记录获取给相应的执行器(而不是整个表获取给一个执行器)

由于Phoenix表df上没有直接的
过滤器
,我们在物理计划中只看到
*过滤器不为空(功能#21)


正如您所提到的,当您对Phoenix表应用过滤器时,它的数据更少。通过在其他数据集中查找
feature\u id
,可以将过滤器推送到
feature
列上的phoenix表中

//This spread across workers  - fully distributed
val dfToJoin = sparkSession.createDataset(rddToJoin)

//This sits in driver - not distributed
val list_of_feature_ids = dfToJoin.dropDuplicates("feature")
  .select("feature")
  .map(r => r.getString(0))
  .collect
  .toList

//This spread across workers  - fully distributed
val tableDf = sparkSession
  .read
  .option("table", "table")
  .option("zkURL", "localhost")
  .format("org.apache.phoenix.spark")
  .load()
  .filter($"FEATURE".isin(list_of_feature_ids:_*)) //added filter

//This spread across workers  - fully distributed
val joinedDf = dfToJoin.join(tableDf, "columnToJoinOn")

joinedDf.explain()

joinedDf.explain(true)
说明了什么?spark支持在一个步骤中完成这项工作吗?
//This spread across workers  - fully distributed
val dfToJoin = sparkSession.createDataset(rddToJoin)

//This sits in driver - not distributed
val list_of_feature_ids = dfToJoin.dropDuplicates("feature")
  .select("feature")
  .map(r => r.getString(0))
  .collect
  .toList

//This spread across workers  - fully distributed
val tableDf = sparkSession
  .read
  .option("table", "table")
  .option("zkURL", "localhost")
  .format("org.apache.phoenix.spark")
  .load()
  .filter($"FEATURE".isin(list_of_feature_ids:_*)) //added filter

//This spread across workers  - fully distributed
val joinedDf = dfToJoin.join(tableDf, "columnToJoinOn")

joinedDf.explain()