Scala 如何计算流数据集中数组字段中的元素数（一个除外）？_Scala_Apache Spark_Spark Structured Streaming

Scala 如何计算流数据集中数组字段中的元素数（一个除外）？

scala apache-spark

Scala 如何计算流数据集中数组字段中的元素数（一个除外）？,scala,apache-spark,spark-structured-streaming,Scala,Apache Spark,Spark Structured Streaming,我使用Spark 2.1.0.cloudera1 我在流式数据帧中有一个数组，数组中的数据如下所示： ["Windows","Ubuntu","Ubuntu","Mac","Mac","Windows","Windows"] 我需要此数组的大小，不包括元素“Windows”，即下面是我采用的方法 WITH os_count AS( SELECT cluster_id, count(e) AS cnt FROM systems LATERAL VIEW EXPLODE(al

我使用Spark 2.1.0.cloudera1

我在流式数据帧中有一个数组，数组中的数据如下所示：

["Windows","Ubuntu","Ubuntu","Mac","Mac","Windows","Windows"]

我需要此数组的大小，不包括元素“Windows”，即

下面是我采用的方法

WITH os_count AS(
SELECT  
    cluster_id,
    count(e) AS cnt
FROM systems
LATERAL VIEW EXPLODE(all_os) exploded as e
WHERE e <> 'Windows'
GROUP BY cluster_id)

SELECT
    a.cluster_id,
    a.memory,
    a.storage,
    c.cnt
FROM
    systems a
JOIN
    os_count c
ON(a.cluster_id = c.cluster_id)

这返回7，但我想过滤掉带有“Windows”的元素，应该返回4，不知道如何在不执行连接的情况下继续

我是通过在spark（Scala）中编写一个UDF实现的，下面是逻辑：

    import org.apache.spark.sql.functions._

    val osCountFunction: Seq[String] => Int = _.par.filter(_!="Windows").size
    val osCountUDF = udf(osCountFunction)

如果有更好的方法，请告诉我

编辑1

自定义项的使用：

val inputStream = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", bootstrapServers)
      .option("subscribe", topics)
      .load()
      .selectExpr("CAST(value AS STRING)")
      .as[String]
      .select(from_json($"value",systemSchema).as("data"))
      .withColumn("os_count_with_udf", osCountUDF(col("data.all_os")))

inputStream.createOrReplaceTempView("data_view")
spark.sql("SELECT os_count_with_udf from data_view")
      .writeStream
      .format("console")
      .option("truncate","false")
      .start()

注意：data.all_os是数组[String]类型。

你最好升级到2.4.0，因为：）：）我希望我可以，我的客户端群集被2.1.0卡住了，我正在拼命等待更新。

val inputStream = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", bootstrapServers)
      .option("subscribe", topics)
      .load()
      .selectExpr("CAST(value AS STRING)")
      .as[String]
      .select(from_json($"value",systemSchema).as("data"))
      .withColumn("os_count_with_udf", osCountUDF(col("data.all_os")))

inputStream.createOrReplaceTempView("data_view")
spark.sql("SELECT os_count_with_udf from data_view")
      .writeStream
      .format("console")
      .option("truncate","false")
      .start()