Apache spark Spark Streaming聚合和筛选在同一窗口中_Apache Spark_Spark Streaming

Apache spark Spark Streaming聚合和筛选在同一窗口中

apache-spark

Apache spark Spark Streaming聚合和筛选在同一窗口中,apache-spark,spark-streaming,Apache Spark,Spark Streaming,我有一个相当简单的任务-事件即将到来，我想在同一窗口中按键过滤那些值高于每组平均值的事件。我认为这是守则的相关部分： val avgfuel = events .groupBy(window($"enqueuedTime", "30 seconds"), $"weatherCondition") .agg(avg($"fuelEfficiencyPercentage") as "avg_fuel") val joined = events.join(avgfu

我有一个相当简单的任务-事件即将到来，我想在同一窗口中按键过滤那些值高于每组平均值的事件。我认为这是守则的相关部分：

val avgfuel = events
    .groupBy(window($"enqueuedTime", "30 seconds"), $"weatherCondition")    
    .agg(avg($"fuelEfficiencyPercentage") as "avg_fuel")    

val joined = events.join(avgfuel, Seq("weatherCondition"))
    .filter($"fuelEfficiencyPercentage" > $"avg_fuel")

val streamingQuery1 = joined.writeStream
    .outputMode("append").
    .trigger(Trigger.ProcessingTime("10 seconds")).
    .option("checkpointLocation", checkpointLocation).
    .format("json").option("path", containerOutputLocation).start()

事件是一个数据流。问题是我在输出位置得到了空文件。我在Scala 2.11中使用Databricks 3.5-Spark 2.2.1

我做错了什么

谢谢

编辑：更完整的代码-

val inputStream = spark.readStream
  .format("eventhubs") // working with azure event hubs
  .options(eventhubParameters)
  .load()

val schema = (new StructType)    
      .add("id", StringType)
      .add("latitude", StringType)
      .add("longitude", StringType)
      .add("tirePressure", FloatType)
      .add("fuelEfficiencyPercentage", FloatType)
      .add("weatherCondition", StringType)

val df1 = inputStream.select($"body".cast("string").as("value")
                             , from_unixtime($"enqueuedTime").cast(TimestampType).as("enqueuedTime")
                             ).withWatermark("enqueuedTime", "1 minutes")

val df2 = df1.select(from_json(($"value"), schema).as("body")
                     , $"enqueuedTime")

val df3 = df2.select(
  $"enqueuedTime"
  , $"body.id".cast("integer")
  , $"body.latitude".cast("float")
  , $"body.longitude".cast("float")
  , $"body.tirePressure"
  , $"body.fuelEfficiencyPercentage"
  , $"body.weatherCondition"
)

val avgfuel = df3
  .groupBy(window($"enqueuedTime", "10 seconds"), $"weatherCondition" )    
  .agg(avg($"fuelEfficiencyPercentage") as "fuel_avg", stddev($"fuelEfficiencyPercentage") as "fuel_stddev")
  .select($"weatherCondition", $"fuel_avg")

val broadcasted = sc.broadcast(avgfuel)

val joined = df3.join(broadcasted.value, Seq("weatherCondition"))
                .filter($"fuelEfficiencyPercentage" > $"fuel_avg")

val streamingQuery1 = joined.writeStream.
      outputMode("append").
      trigger(Trigger.ProcessingTime("10 seconds")).
      option("checkpointLocation", checkpointLocation).
      format("json").option("path", outputLocation).start()

这将在没有错误的情况下执行，并在一段时间后开始写入结果。我可能是因为聚合结果的广播，但我不确定。

小调查；）

事件不能是数据流，因为您可以选择对其使用数据集操作-它必须是数据集

Spark 2.2中不允许流连接。我已尝试使用

事件作为速率
源运行您的代码，我得到：
org.apache.spark.sql.AnalysisException:不支持两个流数据帧/数据集之间的内部联接；；
连接内部（值#1L=事件值#41L）


结果出乎意料-可能您使用了read
而不是readStream
，并且您没有创建流数据集，而是静态的。将其更改为readStream
，当然，在升级到2.3之后，它就可以工作了
上面没有注释的代码是正确的，应该在Spark 2.3上正确运行。请注意，您还必须将模式更改为complete
，而不是append
，因为您正在进行聚合
它真的有用吗。AFAIK还不支持两个流数据集之间的任何类型的连接。单独分析逻辑-如果窗口中只有一个逻辑，则输出应为空。如果使用结构化流，则事件不能为数据流；）@user8371915流连接位于2中。3@T.Gawęda但它不是2.3。@user8371915是的，所以它可能会失败，也许它没有抛出异常，但默默地生成零rows@user8371915Spark正在引发异常，所以对这个问题的唯一解释是使用了错误的方法读取数据：）