Spark DataFrame:忽略groupBy中ID为空的列

Spark DataFrame:忽略groupBy中ID为空的列,dataframe,group-by,apache-spark-sql,Dataframe,Group By,Apache Spark Sql,我有一个数据帧,例如,具有以下结构: ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ... ============================================================ 1 | 123 | 1 | | | A1 | B1 | | ... <- only P1_x columns filled 1 | 123 | 2

我有一个数据帧,例如,具有以下结构:

ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ...
============================================================
1  | 123  | 1     |       |       | A1   | B1   |      | ... <- only P1_x columns filled
1  | 123  | 2     |       |       | A2   | B2   |      | ... <- only P1_x filled
1  | 123  | 3     |       |       | A3   | B3   |      | ... <- only P1_x filled
1  | 123  |       | 1     |       |      |      | A4   | ... <- only P2_x filled
1  | 123  |       | 2     |       |      |      | A5   | ... <- only P2_x filled
1  | 123  |       |       | 1     |      |      |      | ... <- only P3_x filled
ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ... | combined_ID 
===========================================================================
1  | 123  | 1     |       |       | A1   | B1   |      | ... | 1
1  | 123  | 2     |       |       | A2   | B2   |      | ... | 2
1  | 123  | 3     |       |       | A3   | B3   |      | ... | 3
1  | 123  |       | 1     |       |      |      | A4   | ... | 1
1  | 123  |       | 2     |       |      |      | A5   | ... | 2
1  | 123  |       |       | 1     |      |      |      | ... | 1

这可能吗?如何实现?谢谢大家!

我找到了这个问题的解决方案:由于不相关的x_ID列为空,一种可能的方法是创建一个新列
组合的x_ID
,其中包含所有x_ID列值的串联(这将只包含一个值,因为每行中只有一个x_ID不是空的):

这将DF更改为以下结构:

ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ...
============================================================
1  | 123  | 1     |       |       | A1   | B1   |      | ... <- only P1_x columns filled
1  | 123  | 2     |       |       | A2   | B2   |      | ... <- only P1_x filled
1  | 123  | 3     |       |       | A3   | B3   |      | ... <- only P1_x filled
1  | 123  |       | 1     |       |      |      | A4   | ... <- only P2_x filled
1  | 123  |       | 2     |       |      |      | A5   | ... <- only P2_x filled
1  | 123  |       |       | 1     |      |      |      | ... <- only P3_x filled
ID | Date | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ... | combined_ID 
===========================================================================
1  | 123  | 1     |       |       | A1   | B1   |      | ... | 1
1  | 123  | 2     |       |       | A2   | B2   |      | ... | 2
1  | 123  | 3     |       |       | A3   | B3   |      | ... | 3
1  | 123  |       | 1     |       |      |      | A4   | ... | 1
1  | 123  |       | 2     |       |      |      | A5   | ... | 2
1  | 123  |       |       | 1     |      |      |      | ... | 1
现在,我可以简单地按ID、日期和组合ID对DF进行分组,并通过例如
max
函数来聚合所有相关列,以获得非空单元格的值:

var groupByColumns : Seq[String] = Seq("ID", "Date", "x_ID")
var aggColumns : Seq[String] = Seq("P1_ID", "P2_ID", "P3_ID", "P1_A", "P1_B", "P2_A", ...)

myDF = myDF.groupBy(groupByColumns.head, groupByColumns.tail : _*).agg(aggColumns.head, aggColumns.tail : _*)
结果:

ID | Date | combined_ID | P1_ID | P2_ID | P3_ID | P1_A | P1_B | P2_A | ... 
===========================================================================
1  | 123  | 1           | 1     | 1     | 1     | A1   | B1   | A4   | ...
1  | 123  | 2           | 2     | 2     |       | A2   | B2   | A5   | ...
1  | 123  | 3           | 3     |       |       | A3   | B3   |      | ...

你可以选择这个StackOverflow博客。这可能有助于解决您的问题“”