Apache spark 仅将每行的NOTNULL列收集到数组中_Apache Spark_Apache Spark Sql

Apache spark 仅将每行的NOTNULL列收集到数组中

apache-spark

Apache spark 仅将每行的NOTNULL列收集到数组中,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,困难在于我试图尽可能避免UDF 我有一个数据集“wordsDS”，其中包含许多空值： +------+------+------+------+ |word_0|word_1|word_2|word_3| +------+------+------+------+ | a| b| null| d| | null| f| m| null| | null| null| d| null| +--------------+------+----

困难在于我试图尽可能避免UDF

我有一个数据集“wordsDS”，其中包含许多空值：

+------+------+------+------+
|word_0|word_1|word_2|word_3|
+------+------+------+------+
|     a|     b|  null|     d|
|  null|     f|     m|  null|
|  null|  null|     d|  null|
+--------------+------+-----|

我需要收集要排列的每行的所有列。我事先不知道列的数量，所以我使用columns（）方法
但这种方法会产生空元素：

+--------------------+ | collected| +--------------------+ | [a, b,,d]| | [, f, m,,]| | [,, d,,]| +--------------------+
相反，我需要以下结果：

+--------------------+ | collected| +--------------------+ | [a, b, d]| | [f, m]| | [d]| +--------------------+
因此，基本上，我需要收集每行的所有列，以满足以下要求：

结果数组不包含空元素

不知道前面的列数
我也考虑过过滤数据集的“collected”列中的空值的方法，但除了UDF之外，我想不出任何其他方法。我试图避免UDF，以免影响性能，如果有人能建议一种方法，以尽可能少的开销过滤数据集的“collected”列中的空值，那将非常有用。
您可以使用
数组（“*”）
将所有元素放入一个数组，然后使用
array\u，除了
（需要Spark 2.4+）来过滤空值：

df .select(array_except(array("*"),array(lit(null))).as("collected")) .show()
给予
火花
df .select(array_except(array("*"),array(lit(null))).as("collected")) .show()

+---------+ |collected| +---------+ |[a, b, d]| | [f, m]| | [d]| +---------+

scala> var df = Seq(("a", "b", "null", "d"),("null", "f", "m", "null"),("null", "null", "d", "null")).toDF("word_0","word_1","word_2","word_3") scala> def arrayNullFilter = udf((arr: Seq[String]) => arr.filter(x=>x != "null")) scala> df.select(array('*).as('all)).withColumn("test",arrayNullFilter(col("all"))).show +--------------------+---------+ | all| test| +--------------------+---------+ | [a, b, null, d]|[a, b, d]| | [null, f, m, null]| [f, m]| |[null, null, d, n...| [d]| +--------------------+---------+