Pyspark 使用spark结构化流媒体的累积计数_Pyspark_Spark Streaming_Spark Structured Streaming

Pyspark 使用spark结构化流媒体的累积计数

pyspark

Pyspark 使用spark结构化流媒体的累积计数,pyspark,spark-streaming,spark-structured-streaming,Pyspark,Spark Streaming,Spark Structured Streaming,我想使用移动窗口计算过去1小时内数据框列中值的累积计数。我可以通过使用rangeBetween的pyspark非流式窗口功能获得预期的输出，但我希望使用实时数据处理，因此尝试使用spark结构化流式处理，以便在系统中出现任何新记录/事务时，我可以获得所需的输出数据如下 time,col 2019-04-27 01:00:00,A 2019-04-27 00:01:00,A 2019-04-27 00:05:00,B 2019-04-27 01:01:00,A 2019-04-27 00:08:

我想使用移动窗口计算过去1小时内数据框列中值的累积计数。我可以通过使用rangeBetween的pyspark非流式窗口功能获得预期的输出，但我希望使用实时数据处理，因此尝试使用spark结构化流式处理，以便在系统中出现任何新记录/事务时，我可以获得所需的输出

数据如下

time,col
2019-04-27 01:00:00,A
2019-04-27 00:01:00,A
2019-04-27 00:05:00,B
2019-04-27 01:01:00,A
2019-04-27 00:08:00,B
2019-04-27 00:03:00,A
2019-04-27 03:03:00,A

使用pyspark非流媒体结构化流媒体，这就是我正在尝试的

如何使用spark结构化流媒体复制相同内容？

您可以按窗口幻灯片分组并计数

结构化流媒体中的字数统计示例-

 val lines = spark.readStream
  .format("socket")
  .option("host", host)
  .option("port", port)
  .option("includeTimestamp", true)
  .load()

// Split the lines into words, retaining timestamps
val words = lines.as[(String, Timestamp)].flatMap(line =>
  line._1.split(" ").map(word => (word, line._2))
).toDF("word", "timestamp")

val windowDuration = "10 seconds"
val slideDuration = "5 seconds"

// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
  window($"timestamp", windowDuration, slideDuration), $"word"
).count().orderBy("window")

// Start running the query that prints the windowed word counts to the console
    val query = windowedCounts.writeStream
      .outputMode("complete")
      .format("console")
      .option("truncate", "false")
      .start()

query.awaitTermination()

您好，您能用我随代码提供的示例数据集获取累积计数吗？

userSchema = StructType([
    StructField("time", TimestampType()),
    StructField("col", StringType())
])


lines2 = spark \
    .readStream \
.format('csv')\
.schema(userSchema)\
 .csv("/datalocation")

windowedCounts = lines2.groupBy(
    window(lines2.time, "1 hour"),
    lines2.col
).count()

windowedCounts.writeStream.format("memory").outputMode("complete").queryName("test2").option("truncate","false").start()

spark.table("test2").show(truncate=False)

streaming output:
+------------------------------------------+---+-----+
|window                                    |col|count|
+------------------------------------------+---+-----+
|[2019-04-27 03:00:00, 2019-04-27 04:00:00]|A  |1    |
|[2019-04-27 00:00:00, 2019-04-27 01:00:00]|A  |2    |
|[2019-04-27 01:00:00, 2019-04-27 02:00:00]|A  |2    |
|[2019-04-27 00:00:00, 2019-04-27 01:00:00]|B  |2    |
+------------------------------------------+---+-----+

 val lines = spark.readStream
  .format("socket")
  .option("host", host)
  .option("port", port)
  .option("includeTimestamp", true)
  .load()

// Split the lines into words, retaining timestamps
val words = lines.as[(String, Timestamp)].flatMap(line =>
  line._1.split(" ").map(word => (word, line._2))
).toDF("word", "timestamp")

val windowDuration = "10 seconds"
val slideDuration = "5 seconds"

// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
  window($"timestamp", windowDuration, slideDuration), $"word"
).count().orderBy("window")

// Start running the query that prints the windowed word counts to the console
    val query = windowedCounts.writeStream
      .outputMode("complete")
      .format("console")
      .option("truncate", "false")
      .start()

query.awaitTermination()