Apache spark 使用joinWith连接Apache Spark中的多个数据集

Apache spark 使用joinWith连接Apache Spark中的多个数据集,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我想用joinWith连接三个数据集,然后以一种很好的方式处理它们,比如case((t1,t2),t3))=>,但是它失败了,出现了一个异常。错误的原因很清楚。以这种方式连接两个数据集的结果如下: +-------------+----------+ | _1| _2| +-------------+----------+ |[1, Name1, 1]|[1, Dept1]| |[2, Name2, 2]|[2, Dept2]| |[3, Name3, 3]|[

我想用
joinWith
连接三个数据集,然后以一种很好的方式处理它们,比如
case((t1,t2),t3))=>
,但是它失败了,出现了一个异常。错误的原因很清楚。以这种方式连接两个数据集的结果如下:

+-------------+----------+
|           _1|        _2|
+-------------+----------+
|[1, Name1, 1]|[1, Dept1]|
|[2, Name2, 2]|[2, Dept2]|
|[3, Name3, 3]|[3, Dept3]|
+-------------+----------+
因此,我无法将结果表与后续表联接。也许还有别的办法?甚至可以以“类型安全”的方式(使用
joinWith
)连接多个表吗

想法:

objectmainapp{
案例类Emp(empId:Int,name:String,deptId:Int)
案例类Dept(deptId:Int,name:String)
案例类地址(empId:Int,name:String)
def main(参数:数组[字符串]):单位={
val spark=火花会话
.builder()
.master(“本地[*]”)
.appName(“火花试验”)
.getOrCreate()
导入spark.implicits_
val emps=Seq(
(1,“名称1”,1),
(2,“名称2”,2),
(3,“名称3”,3)
).toDF(“empId”、“name”、“deptId”)。作为[Emp]
val depts=序号(
(1,“部门1”),
(2,“Dept2”),
(3,“Dept3”)
).toDF(“部门名称”)。作为[部门]
val addrs=Seq(
(1,“Addr1”),
(2,“Addr2”),
(3,“增补3”)
).toDF(“empId”,“name”)。作为[地址]
val结果=emps
.joinWith(部门、EMP(“部门”)===部门(“部门”)、“内部”)
.joinWith(地址,EMP(“empId”)==地址(“empId”),“内部”)
//result.map{
//案例(EMP、部门、地址)=>???
//        }
}
}
例外情况:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) empId#7 missing from _1#41,_2#42,empId#34,name#35 in operator !Join Inner, (empId#7 = empId#34). Attribute(s) with the same name appear in the operation: empId. Please check if the right attribute(s) are used.;;
!Join Inner, (empId#7 = empId#34)
:- Join Inner, (_1#41.deptId = _2#42.deptId)
:  :- Project [named_struct(empId, empId#7, name, name#8, deptId, deptId#9) AS _1#41]
:  :  +- Project [_1#3 AS empId#7, _2#4 AS name#8, _3#5 AS deptId#9]
:  :     +- LocalRelation [_1#3, _2#4, _3#5]
:  +- Project [named_struct(deptId, deptId#22, name, name#23) AS _2#42]
:     +- Project [_1#19 AS deptId#22, _2#20 AS name#23]
:        +- LocalRelation [_1#19, _2#20]
+- Project [_1#31 AS empId#34, _2#32 AS name#35]
   +- LocalRelation [_1#31, _2#32]

    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:326)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
    at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:85)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
    at org.apache.spark.sql.Dataset.joinWith(Dataset.scala:1079)
    at sample.sample.MainApp$.main(MainApp.scala:41)
    at sample.sample.MainApp.main(MainApp.scala)
您需要分两步完成,并在开始时使用带有案例的地图。在这方面,新的dataset API就到此为止了。老人可以做n向连接

具体地说,与DF相反,DS在使用
joinWith
时从左数据集和右数据集返回两个类的元组。该功能定义为:

joinWith[U](other: Dataset[U], 
            condition: Column, 
            joinType: S): Dataset[(T,U)]

Cleraly-从您的输出来看,它不能像
join
API那样工作。

似乎可以使用joinWith进行连接,但它不是很优雅。当第一个连接创建一个元组时,您需要以某种方式指向它,并且可以使用
$“\u 1.empId”
来完成

val结果=emps
.joinWith(部门、EMP(“部门”)===部门(“部门”)、“内部”)
.joinWith(地址,$“_1.empId”==地址(“empId”),“内部”)

为了避免创建元组,您必须先命名数据集,然后执行下一次连接。

您的预期输出是什么?您能接受答案吗?