Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 如何计算流数据集中数组字段中的元素数(一个除外)?_Scala_Apache Spark_Spark Structured Streaming - Fatal编程技术网

Scala 如何计算流数据集中数组字段中的元素数(一个除外)?

Scala 如何计算流数据集中数组字段中的元素数(一个除外)?,scala,apache-spark,spark-structured-streaming,Scala,Apache Spark,Spark Structured Streaming,我使用Spark 2.1.0.cloudera1 我在流式数据帧中有一个数组,数组中的数据如下所示: ["Windows","Ubuntu","Ubuntu","Mac","Mac","Windows","Windows"] 我需要此数组的大小,不包括元素“Windows”,即 下面是我采用的方法 WITH os_count AS( SELECT cluster_id, count(e) AS cnt FROM systems LATERAL VIEW EXPLODE(al

我使用Spark 2.1.0.cloudera1

我在流式数据帧中有一个数组,数组中的数据如下所示:

["Windows","Ubuntu","Ubuntu","Mac","Mac","Windows","Windows"]
我需要此数组的大小,不包括元素“Windows”,即

下面是我采用的方法

WITH os_count AS(
SELECT  
    cluster_id,
    count(e) AS cnt
FROM systems
LATERAL VIEW EXPLODE(all_os) exploded as e
WHERE e <> 'Windows'
GROUP BY cluster_id)

SELECT
    a.cluster_id,
    a.memory,
    a.storage,
    c.cnt
FROM
    systems a
JOIN
    os_count c
ON(a.cluster_id = c.cluster_id)

这返回7,但我想过滤掉带有“Windows”的元素,应该返回4,不知道如何在不执行连接的情况下继续

我是通过在spark(Scala)中编写一个UDF实现的,下面是逻辑:

    import org.apache.spark.sql.functions._

    val osCountFunction: Seq[String] => Int = _.par.filter(_!="Windows").size
    val osCountUDF = udf(osCountFunction)
如果有更好的方法,请告诉我

编辑1

自定义项的使用:

val inputStream = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", bootstrapServers)
      .option("subscribe", topics)
      .load()
      .selectExpr("CAST(value AS STRING)")
      .as[String]
      .select(from_json($"value",systemSchema).as("data"))
      .withColumn("os_count_with_udf", osCountUDF(col("data.all_os")))

inputStream.createOrReplaceTempView("data_view")
spark.sql("SELECT os_count_with_udf from data_view")
      .writeStream
      .format("console")
      .option("truncate","false")
      .start()

注意:data.all_os是数组[String]类型。

你最好升级到2.4.0,因为:):)我希望我可以,我的客户端群集被2.1.0卡住了,我正在拼命等待更新。
val inputStream = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", bootstrapServers)
      .option("subscribe", topics)
      .load()
      .selectExpr("CAST(value AS STRING)")
      .as[String]
      .select(from_json($"value",systemSchema).as("data"))
      .withColumn("os_count_with_udf", osCountUDF(col("data.all_os")))

inputStream.createOrReplaceTempView("data_view")
spark.sql("SELECT os_count_with_udf from data_view")
      .writeStream
      .format("console")
      .option("truncate","false")
      .start()