Scala 在Dataframe的列中添加缺少的类别_Scala_Apache Spark_Apache Spark Dataset

Scala 在Dataframe的列中添加缺少的类别

scala apache-spark

Scala 在Dataframe的列中添加缺少的类别,scala,apache-spark,apache-spark-dataset,Scala,Apache Spark,Apache Spark Dataset,我有以下火花数据帧。列country中有10个不同的值。我希望得到预期结果中给出的新数据帧 DataFrame +-------------+--------------+------------------+ | Code| country| t1| +-------------+--------------+------------------+ | A| Canada| 6218.40000000

我有以下火花数据帧。列country中有10个不同的值。我希望得到预期结果中给出的新数据帧

DataFrame
+-------------+--------------+------------------+
|         Code|       country|                t1|
+-------------+--------------+------------------+
|            A|        Canada| 6218.400000000001|
|            A|       Central|              30.4|
|            A|        France|24540.629999999965|
|            A|       Germany|27688.029999999966|
|            A|     Northeast|             51.41|
|            A|     Northwest| 56261.31000000015|
|            A|     Southeast|             55.71|
|            A|     Southwest| 92640.42999999833|
|            A|United Kingdom|              0.64|
|            B|     Australia|145856.31999999806|
|            C|        Canada| 28223.26999999983|
|            C|     Northwest|              0.87|
|            C|     Southwest|              0.44|
+-------------+--------------+------------------+

Distinct values for country column are :
+--------------+
|       country|
+--------------+
|     Australia|
|        Canada|
|       Central|
|        France|
|       Germany|
|     Northeast|
|     Northwest|
|     Southeast|
|     Southwest|
|United Kingdom|
+--------------+

Expected Result :

+-------------+--------------+------------------+
|         Code|       country|                t1|
+-------------+--------------+------------------+
|            A|     Australia|              null|
|            A|        Canada| 6218.400000000001|
|            A|       Central|              30.4|
|            A|        France|24540.629999999965|
|            A|       Germany|27688.029999999966|
|            A|     Northeast|             51.41|
|            A|     Northwest| 56261.31000000015|
|            A|     Southeast|             55.71|
|            A|     Southwest| 92640.42999999833|
|            A|United Kingdom|              0.64|
|            B|     Australia|145856.31999999806|
|            B|        Canada|              null|
|            B|       Central|              null|
|            B|        France|              null|
|            B|       Germany|              null|
|            B|     Northeast|              null|
|            B|     Northwest|              null|
|            B|     Southeast|              null|
|            B|     Southwest|              null|
|            B|United Kingdom|              null|
|            C|     Australia|145856.31999999806|
|            C|        Canada| 28223.26999999983|
|            C|       Central|              null|
|            C|        France|              null|
|            C|       Germany|              null|
|            C|     Northeast|              null|
|            C|     Northwest|              0.87|
|            C|     Southeast|              null|
|            C|     Southwest|              0.44|
|            C|United Kingdom|              null|

如何在scala中实现预期的输出？我已经引用了数据集的函数/方法，但没有找到任何线索可以让我从这个开始

请注意，可能有多列，因此对于多列逻辑是一样的，我想插入每个缺少的类别所有列中的类别

我是scala的初学者。提前感谢：）

将不同的代码与国家/地区交叉连接，然后将其留在原始表中差不多

val codes= data.select($"Code").distinct
val combinations = codes.crossJoin(countries)
val result = combinations.join(data, combinations("code")===data("code") && combinations("country")===data("country"),"leftouter").select(combinations("code"),combinations("coiuntry"),data("t1")).orderBy($"code",$"value")

df.select（$“country”）。distinct

？它将仅从数据框返回country列的不同值。输入和预期输出是什么？请参考问题。我在问题描述中提到了这两个方面。例如，我只选了两个专栏，但在这种情况下可能会有多个专栏，我认为这是行不通的。这正是你在问题中提出的问题-如果你的问题不同，你应该问。。。在任何情况下，crossjoin都会为您提供任何列的所有组合，因此如果您有多个这样的列，您可能也应该在最后一个左连接之前将它们交叉连接编辑该问题。我的回答仍然是正确的-您需要交叉连接所有不同的值，然后左连接到实际数据（在您的示例中为t1）是，但由于数据量很大，这将影响性能。还有别的办法吗？