Scala apachespark通过关系DataFrame将一个数据帧连接到它自己会产生空结果
我在使用ApacheSpark(使用ScalaAPI)时遇到了一个奇怪的问题。有两个数据帧对象,我们称它们为bean和关系Scala apachespark通过关系DataFrame将一个数据帧连接到它自己会产生空结果,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我在使用ApacheSpark(使用ScalaAPI)时遇到了一个奇怪的问题。有两个数据帧对象,我们称它们为bean和关系 bean数据框架由两列组成,分别命名为id和data。考虑所有 id s都是唯一的, >数据< /强>持有某个动作的文本表示或某个动作的目标。 关系数据框定义了操作与其目标之间的关系。它由两列组成:actionId和targetId (查看下面的代码片段以查看DataFrame对象的表表示) 基本上,我尝试将bean别名为两个新的数据框对象:操作和目标,然后通过关系数据框
+--------+--------+------+------+
|actionId|targetId|action|target|
+--------+--------+------+------+
+--------+--------+------+------+
下面是一些代码来说明发生了什么:
//define sql context, using
val sqlContext = new SQLContext(sparkContext)
// ...
// Produce the following DataFrame objects:
// beans: relation:
// +--------+--------+ +----------+----------+
// | id | data | | actionId | targetId |
// +--------+--------+ +----------+----------+
// | a | save | | a | 1 |
// +--------+--------+ +----------+----------+
// | b | delete | | b | 2 |
// +--------+--------+ +----------+----------+
// | c | read | | c | 3 |
// +--------+--------+ +----------+----------+
// | 1 | file |
// +--------+--------+
// | 2 | os |
// +--------+--------+
// | 3 | book |
// +--------+--------+
case class Bean(id: String, data: String)
case class Relation(actionId: String, targetId: String)
val beans = sqlContext.createDataFrame(
Bean("a", "save") :: Bean("b", "delete") :: Bean("c", "read") ::
Bean("1", "file") :: Bean("2", "os") :: Bean("3", "book") :: Nil
)
val relation = sqlContext.createDataFrame(
Relation("a", "1") :: Relation("b", "2") :: Relation("c", "3") :: Nil
)
// alias beans as "actions" and "targets" to avoid ambiguity
val actions = beans as "actions"
val targets = beans as "targets"
// join actions and targets via relation
actions.join(relation, actions("id") === relation("actionId"))
.join(targets, targets("id") === relation("targetId"))
.select(actions("id") as "actionId", targets("id") as "targetId",
actions("data") as "action", targets("data") as "target")
.show()
此代码段所需的输出为
// desired output
// +----------+----------+--------+--------+
// | actionId | targetId | action | target |
// +----------+----------+--------+--------+
// | a | 1 | save | file |
// +----------+----------+--------+--------+
// | b | 2 | delete | os |
// +----------+----------+--------+--------+
// | c | 3 | read | book |
// +----------+----------+--------+--------+
然而,真实的(奇怪的)输出是一个空的数据帧
+--------+--------+------+------+
|actionId|targetId|action|target|
+--------+--------+------+------+
+--------+--------+------+------+
我曾怀疑将数据帧连接到自身存在问题,但中的示例证明这种怀疑是错误的
我正在使用Spark 1.4.1和Scala 2.10.4,但在Spark 1.5.1和Scala 2.11.7上得到了相同的结果
更改DataFrame对象的架构不是一个选项。有什么建议吗
解决方案
参考。如果您收到这样的错误消息
error: value $ is not a member of StringContext
actions.join(relation, $"actions.id" === $"actionId")
^
请确保添加以下语句
import sqlContext.implicits._
解决方案
我会将其分为两个阶段,因此:
val beans = sqlContext.createDataFrame(
Bean("a", "save") ::
Bean("b", "delete") ::
Bean("c", "read") ::
Bean("1", "file") ::
Bean("2", "os") ::
Bean("3", "book") ::
Nil
)
val relation = sqlContext.createDataFrame(
Relation("a", "1") ::
Relation("b", "2") ::
Relation("c", "3") ::
Nil
)
// "add" action
val step1 = beans.join(relation, beans("id") === relation("actionId"))
.select(
relation("actionId"),
relation("targetId"),
beans("data").as("action")
)
// "add" target column
val result = step1.join( beans, beans("id") === relation("targetId"))
.select(
step1("actionId"),
step1("targetId"),
step1("action"),
beans("data").as("target")
)
result.show
评论
尽管如此,将不同的豆子(“a”、“b”、“c”)与(“1”、“2”、“3”)放在同一张表中似乎很不寻常,也很难闻。您在这里所做的与您链接的示例之间存在细微的差异。在链接的答案中,我直接使用
列
对象,这里您使用对数据帧
应用
方法。要查看差异,只需在REPL中键入这两个:
scala> actions("actions.id")
res59: org.apache.spark.sql.Column = id
scala> col("actions.id")
res60: org.apache.spark.sql.Column = actions.id
要正确识别别名,您必须直接使用列
对象,否则别名将被删除。这意味着您需要这样的查询:
actions.join(relation, $"actions.id" === $"actionId")
.join(targets, $"targets.id" === $"targetId")
或
让它工作。当然,在RHS上使用col
在这里是严格可选的。您可以像以前一样使用apply
如果您喜欢使用apply
,可以重命名联接列:
val targets = beans.withColumnRenamed("id", "_targetId")
val actions = beans.withColumnRenamed("id", "_actionId")
actions.join(relation, actions("_actionId") === relation("actionId"))
.join(targets, targets("_targetId") === relation("targetId"))