Scala 如何计算流数据集中数组字段中的元素数(一个除外)?
我使用Spark 2.1.0.cloudera1 我在流式数据帧中有一个数组,数组中的数据如下所示:Scala 如何计算流数据集中数组字段中的元素数(一个除外)?,scala,apache-spark,spark-structured-streaming,Scala,Apache Spark,Spark Structured Streaming,我使用Spark 2.1.0.cloudera1 我在流式数据帧中有一个数组,数组中的数据如下所示: ["Windows","Ubuntu","Ubuntu","Mac","Mac","Windows","Windows"] 我需要此数组的大小,不包括元素“Windows”,即 下面是我采用的方法 WITH os_count AS( SELECT cluster_id, count(e) AS cnt FROM systems LATERAL VIEW EXPLODE(al
["Windows","Ubuntu","Ubuntu","Mac","Mac","Windows","Windows"]
我需要此数组的大小,不包括元素“Windows”,即
下面是我采用的方法
WITH os_count AS(
SELECT
cluster_id,
count(e) AS cnt
FROM systems
LATERAL VIEW EXPLODE(all_os) exploded as e
WHERE e <> 'Windows'
GROUP BY cluster_id)
SELECT
a.cluster_id,
a.memory,
a.storage,
c.cnt
FROM
systems a
JOIN
os_count c
ON(a.cluster_id = c.cluster_id)
这返回7,但我想过滤掉带有“Windows”的元素,应该返回4,不知道如何在不执行连接的情况下继续 我是通过在spark(Scala)中编写一个UDF实现的,下面是逻辑:
import org.apache.spark.sql.functions._
val osCountFunction: Seq[String] => Int = _.par.filter(_!="Windows").size
val osCountUDF = udf(osCountFunction)
如果有更好的方法,请告诉我
编辑1
自定义项的使用:
val inputStream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topics)
.load()
.selectExpr("CAST(value AS STRING)")
.as[String]
.select(from_json($"value",systemSchema).as("data"))
.withColumn("os_count_with_udf", osCountUDF(col("data.all_os")))
inputStream.createOrReplaceTempView("data_view")
spark.sql("SELECT os_count_with_udf from data_view")
.writeStream
.format("console")
.option("truncate","false")
.start()
注意:data.all_os是数组[String]类型。你最好升级到2.4.0,因为:):)我希望我可以,我的客户端群集被2.1.0卡住了,我正在拼命等待更新。
val inputStream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topics)
.load()
.selectExpr("CAST(value AS STRING)")
.as[String]
.select(from_json($"value",systemSchema).as("data"))
.withColumn("os_count_with_udf", osCountUDF(col("data.all_os")))
inputStream.createOrReplaceTempView("data_view")
spark.sql("SELECT os_count_with_udf from data_view")
.writeStream
.format("console")
.option("truncate","false")
.start()