Apache spark 如何检查kafka流中的n个连续事件是否大于或小于阈值限制_Apache Spark_Pyspark_Spark Streaming

Apache spark 如何检查kafka流中的n个连续事件是否大于或小于阈值限制

apache-spark pyspark

Apache spark 如何检查kafka流中的n个连续事件是否大于或小于阈值限制,apache-spark,pyspark,spark-streaming,Apache Spark,Pyspark,Spark Streaming,我是Pypark的新成员。我已经编写了一个pyspark程序，使用窗口操作读取卡夫卡流。我以不同的来源和温度以及时间戳每秒向卡夫卡发布以下消息 {"temperature":34,"time":"2019-04-17 12:53:02","source":"1010101"} {"temperature":29,"time":"2019-04-17 12:53:03","source":"1010101"} {"temperature":28,"time":"2019-04-17 12:53:0

我是Pypark的新成员。我已经编写了一个pyspark程序，使用窗口操作读取卡夫卡流。我以不同的来源和温度以及时间戳每秒向卡夫卡发布以下消息

{"temperature":34,"time":"2019-04-17 12:53:02","source":"1010101"}
{"temperature":29,"time":"2019-04-17 12:53:03","source":"1010101"}
{"temperature":28,"time":"2019-04-17 12:53:04","source":"1010101"}
{"temperature":34,"time":"2019-04-17 12:53:05","source":"1010101"}
{"temperature":45,"time":"2019-04-17 12:53:06","source":"1010101"}
{"temperature":34,"time":"2019-04-17 12:53:07","source":"1010102"}
{"temperature":29,"time":"2019-04-17 12:53:08","source":"1010102"}
{"temperature":28,"time":"2019-04-17 12:53:09","source":"1010102"}
{"temperature":34,"time":"2019-04-17 12:53:10","source":"1010102"}
{"temperature":45,"time":"2019-04-17 12:53:11","source":"1010102"}

如何检查一个源的n个连续温度记录是否超过阈值限制（40），然后将警报发布到Kafka。另外，请让我知道，如果下面的程序是有效的阅读卡夫卡流或需要任何更改

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, FloatType, TimestampType
from pyspark.sql.functions import avg, window, from_json, from_unixtime, unix_timestamp
import uuid

schema = StructType([
    StructField("source", StringType(), True),
    StructField("temperature", FloatType(), True),
    StructField("time", StringType(), True)
])

spark = SparkSession \
    .builder.master("local[8]") \
    .appName("test-app") \
    .getOrCreate()

spark.conf.set("spark.sql.shuffle.partitions", 5)

df1 = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "test") \
    .load() \
    .selectExpr("CAST(value AS STRING)")

df2 = df1.select(from_json("value", schema).alias(
    "sensors")).select("sensors.*")

df3 = df2.select(df2.source, df2.temperature, from_unixtime(
    unix_timestamp(df2.time, 'yyyy-MM-dd HH:mm:ss')).alias('time'))
df4 = df3.groupBy(window(df3.time, "2 minutes", "1 minutes"),
                  df3.source).agg(avg("temperature"))

query1 = df4.writeStream \
    .outputMode("complete") \
    .format("console") \
    .option("checkpointLocation", "/tmp/temporary-" + str(uuid.uuid4())) \
    .start()

query1.awaitTermination()