Scala 基于另一个RDD的第一个字段的值检索现有RDD的第二个字段的值

Scala 基于另一个RDD的第一个字段的值检索现有RDD的第二个字段的值,scala,apache-spark,bigdata,Scala,Apache Spark,Bigdata,我在HDFS中有三个文件中的数据,如下所示 EmployeeManagers.txt(EmpID,ManagerID) EmployeeNames.txt(EmpID,名称) EmployeeSalary.txt(EmpID,Salary) 我想通过从这些文件创建RDD来打印数据,格式为ID、员工姓名、工资、经理姓名 我已经根据密钥加入了3个RDD,即每个文本文件中的第一列,并且能够打印经理ID,但不能打印经理名称 下面是我编写的代码 val manager = sc.textFile("Emp

我在HDFS中有三个文件中的数据,如下所示

EmployeeManagers.txt(EmpID,ManagerID)

EmployeeNames.txt(EmpID,名称)

EmployeeSalary.txt(EmpID,Salary)

我想通过从这些文件创建RDD来打印数据,格式为ID、员工姓名、工资、经理姓名

我已经根据密钥加入了3个RDD,即每个文本文件中的第一列,并且能够打印经理ID,但不能打印经理名称

下面是我编写的代码

val manager = sc.textFile("EmployeeManagers")
val managerRDD = manager.map(x => (x.split(",")(0), x.split(",")(1)))
val name = sc.textFile("EmployeeNames")
val namePairRDD = name.map(x => (x.split(",")(0), x.split(",")(1)))
val salary = sc.textFile("EmployeeSalary")
val salaryPairRDD = salary.map(x => (x.split(",")(0), x.split(",")(1)))
val data = namePair.join(salaryPair).join(managerPair)
电流输出如下所示

scala> data.collect();
res4: Array[(String, ((String, String), String))] = Array((4,((Krinton Kale,4000),6)), (5,((Harry Donal,5000),6)), (2,((Jimmy Kent,2000),4)), (3,((Shannon Witt,3000),4)), (1,((Ronald Rays,1000),5)))

那么,您必须再次在
namePairRDD
上加入,这一次以经理ID为键:

val result = namePairRDD
  .join(salaryPairRDD)
  .join(managerPairRDD)
  .map { case (id, ((name, salary), mngrId)) => (mngrId, (id, name, salary)) }
  .join(namePairRDD) // join again, this time on managerId
  .map { case (_, ((id, name, salary), mngrName)) => (id, name, salary, mngrName) }

result.foreach(println)
// (2,Jimmy Kent,2000.0,Krinton Kale)
// (3,Shannon Witt,3000.0,Krinton Kale)
// (1,Ronald Rays,1000.0,Harry Donal)
// (4,Krinton Kale,4000.0,Christina Fernandez)
// (5,Harry Donal,5000.0,Christina Fernandez)

非常感谢。我没有意识到使用“case”。
val manager = sc.textFile("EmployeeManagers")
val managerRDD = manager.map(x => (x.split(",")(0), x.split(",")(1)))
val name = sc.textFile("EmployeeNames")
val namePairRDD = name.map(x => (x.split(",")(0), x.split(",")(1)))
val salary = sc.textFile("EmployeeSalary")
val salaryPairRDD = salary.map(x => (x.split(",")(0), x.split(",")(1)))
val data = namePair.join(salaryPair).join(managerPair)
scala> data.collect();
res4: Array[(String, ((String, String), String))] = Array((4,((Krinton Kale,4000),6)), (5,((Harry Donal,5000),6)), (2,((Jimmy Kent,2000),4)), (3,((Shannon Witt,3000),4)), (1,((Ronald Rays,1000),5)))
val result = namePairRDD
  .join(salaryPairRDD)
  .join(managerPairRDD)
  .map { case (id, ((name, salary), mngrId)) => (mngrId, (id, name, salary)) }
  .join(namePairRDD) // join again, this time on managerId
  .map { case (_, ((id, name, salary), mngrName)) => (id, name, salary, mngrName) }

result.foreach(println)
// (2,Jimmy Kent,2000.0,Krinton Kale)
// (3,Shannon Witt,3000.0,Krinton Kale)
// (1,Ronald Rays,1000.0,Harry Donal)
// (4,Krinton Kale,4000.0,Christina Fernandez)
// (5,Harry Donal,5000.0,Christina Fernandez)