Apache spark 基于列值联接
我使用的是spark-sql-2.4.1v 如何进行各种连接取决于列的值 样本数据Apache spark 基于列值联接,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我使用的是spark-sql-2.4.1v 如何进行各种连接取决于列的值 样本数据 val data = List( ("20", "score", "school", 14 ,12), ("21", "score", "school", 13 , 13), ("22", "rate", "school"
val data = List(
("20", "score", "school", 14 ,12),
("21", "score", "school", 13 , 13),
("22", "rate", "school", 11 ,14)
)
val df = data.toDF("id", "code", "entity", "value1","value2")
+---+-----+------+------+------+
| id| code|entity|value1|value2|
+---+-----+------+------+------+
| 20|score|school| 14| 12|
| 21|score|school| 13| 13|
| 22| rate|school| 11| 14|
| 21| rate|school| 13| 12|
基于“code”列值,我需要与其他各种表进行连接
val rateDs = // val data1= List(
("22", 11 ,A),
("22", 14 ,B),
("20", 13 ,C),
("21", 12 ,C),
("21", 13 ,D)
)
val df=data1.toDF(“id”、“映射代码”、“映射值”)
如果“code”列的值是“rate”,我需要加入rateDs
如果“code”列的值是“score”,我需要加入scoreDs
如何在spark中处理这些事情?有什么最佳的方法来实现这一点吗
费率字段的预期结果
+---+-----+------+------+------+
| id| code|entity|value1|value2|
+---+-----+------+------+------+
| 22| rate|school| A| B |
| 21| rate|school| D| C |
例如,您可以简单地加入两次
val data = List(
("20", "score", "school", 14 , 12),
("21", "score", "school", 13 , 13),
("22", "rate", "school", 11 , 14),
("21", "rate", "school", 13 , 12)
)
val df = data.toDF("id", "code", "entity", "value1","value2")
val data1 = List(
("22", 11 ,"A"),
("22", 14 ,"B"),
("20", 13 ,"C"),
("21", 12 ,"C"),
("21", 13 ,"D")
)
val rateDF = data1.toDF("id", "map_code","map_val")
df.as("a")
.join(rateDF.as("b"),
col("a.code") === lit("rate")
&& col("a.id") === col("b.id")
&& col("a.value1") === col("b.map_code"), "inner")
.join(rateDF.as("c"),
col("a.code") === lit("rate")
&& col("a.id") === col("c.id")
&& col("a.value2") === col("c.map_code"), "inner")
.select(col("a.id"), col("a.code"), col("a.entity"), col("b.map_val").as("value1"), col("c.map_val").as("value2"))
.show(false)
+---+----+------+------+------+
|id |code|entity|value1|value2|
+---+----+------+------+------+
|22 |rate|school|A |B |
|21 |rate|school|D |C |
+---+----+------+------+------+
嗯,这看起来有点脏,但我不知道多个列…您可以筛选出两个数据帧,与其他数据帧连接并合并它们again@koiralo谢谢,可以使用“when”子句吗?谢谢,可以使用“when”子句吗?我想,这种连接会影响性能。当您连接表时,不建议这样做。这里所做的合并(“b.value1”,“c.value1”)??当
code=rate
时,b.value1将不为空,c.value1将为空,当code=score
时,它将被反转。因此,coalesce将这两个结果收集为一列,但这取决于您,这只是一个示例。在基于col(“a.id”)===col(“b.id”),“left”)条件进行连接后,我需要循环多个列值,即“rateDs”中“df”的“value_1”和“value_2”…如何处理?
val data = List(
("20", "score", "school", 14 , 12),
("21", "score", "school", 13 , 13),
("22", "rate", "school", 11 , 14),
("21", "rate", "school", 13 , 12)
)
val df = data.toDF("id", "code", "entity", "value1","value2")
val data1 = List(
("22", 11 ,"A"),
("22", 14 ,"B"),
("20", 13 ,"C"),
("21", 12 ,"C"),
("21", 13 ,"D")
)
val rateDF = data1.toDF("id", "map_code","map_val")
df.as("a")
.join(rateDF.as("b"),
col("a.code") === lit("rate")
&& col("a.id") === col("b.id")
&& col("a.value1") === col("b.map_code"), "inner")
.join(rateDF.as("c"),
col("a.code") === lit("rate")
&& col("a.id") === col("c.id")
&& col("a.value2") === col("c.map_code"), "inner")
.select(col("a.id"), col("a.code"), col("a.entity"), col("b.map_val").as("value1"), col("c.map_val").as("value2"))
.show(false)
+---+----+------+------+------+
|id |code|entity|value1|value2|
+---+----+------+------+------+
|22 |rate|school|A |B |
|21 |rate|school|D |C |
+---+----+------+------+------+