Apache flink 获取用于处理延迟事件的上一个窗口值_Apache Flink_Flink Streaming_Windowing

Apache flink 获取用于处理延迟事件的上一个窗口值

apache-flink

Apache flink 获取用于处理延迟事件的上一个窗口值,apache-flink,flink-streaming,windowing,Apache Flink,Flink Streaming,Windowing,我正在寻找一种方法来设置窗口，以允许迟到，并让我根据以前为会话计算的值来计算值 “我的会话”值总体上是一个唯一的标识符，不应该有冲突，但从技术上讲，会话可以随时出现。在大多数会话中，大多数事件的处理时间超过5分钟，允许延迟1天应满足任何延迟事件 stream .keyBy { jsonEvent => jsonEvent.findValue("session").toString } .window(ProcessingTimeSessionWindows.withGa

我正在寻找一种方法来设置窗口，以允许迟到，并让我根据以前为会话计算的值来计算值

“我的会话”值总体上是一个唯一的标识符，不应该有冲突，但从技术上讲，会话可以随时出现。在大多数会话中，大多数事件的处理时间超过5分钟，允许延迟1天应满足任何延迟事件

  stream
    .keyBy { jsonEvent => jsonEvent.findValue("session").toString }
    .window(ProcessingTimeSessionWindows.withGap(Time.minutes(5)))
    .allowedLateness(Time.days(1))
    .process { new SessionProcessor }
    .addSink { new HttpSink }

对于每个会话，我都会查找字段的最大值，并检查是否有几个事件没有发生（如果确实发生，则会将最大值字段设置为零）。为此，我决定创建一个

ProcessWindowFunction

Class SessionProcessor extends ProcessWindowFunction[ObjectNode, (String, String, String, Long), String, TimeWindow] {

   override def process(key: String, context: Context, elements: Iterable[ObjectNode], out: Collector[(String, String, String, Long)]): Unit = {
      //Parse and calculate data
      maxValue = if(badEvent1) 0 else maxValue
      maxValue = if(badEvent2) 0 else maxValue          
      out.collect((string1,string2,string3, maxValue))
   }
}

在考虑到后期事件之前，这种方法可以很好地工作。当发生延迟事件时，

maxValue

将重新计算并再次输出到

HttpSink

。我正在寻找一种方法，以便计算上一次

maxValue

和下一次

maxValue

的增量

我正在寻找一种方法来确定：

如果对函数的调用来自延迟事件（我不想重复计算会话总数）

新数据是什么，或者是否有方法存储以前的计算值

在此方面的任何帮助都将不胜感激

编辑：用于ValueState的新代码

卡夫卡消费量

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.connectors.kafka._
import org.apache.flink.streaming.util.serialization.JSONDeserializationSchema
import org.apache.flink.streaming.api.scala._
import com.fasterxml.jackson.databind.node.ObjectNode
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time


object KafkaConsumer {
   def main(args: Array[String]) {
      val env = StreamExecutionEnvironment.getExecutionEnvironment
      env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)
      val properties = getServerProperties
      val consumer = new FlinkKafkaConsumer010[ObjectNode]("test-topic", new JSONDeserializationSchema, properties)
      consumer.setStartFromLatest()
      val stream = env.addSource(consumer)

      stream
        .keyBy { jsonEvent => jsonEvent.findValue("data").findValue("query").findValue("session").toString }
        .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
        .allowedLateness(Time.days(1))
        .process {
          new SessionProcessor
        }
        .print
      env.execute("Kafka APN Consumer")
    }
  }

SessionProcessor.scala

import org.apache.flink.util.Collector
import com.fasterxml.jackson.databind.node.ObjectNode
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.windows.TimeWindow

class SessionProcessor extends ProcessWindowFunction[ObjectNode, (String, String, String, Long), String, TimeWindow] {

  final val previousValue = new ValueStateDescriptor("previousValue", classOf[Long])

  override def process(key: String, context: Context, elements: Iterable[ObjectNode], out: Collector[(String, String, String, Long)]): Unit = {

    val previousVal: ValueState[Long] = context.windowState.getState(previousValue)
    val pVal: Long = previousVal.value match {
      case i: Long => i
    }
    var session = ""
    var user = ""
    var department = ""
    var lVal: Long = 0

    elements.foreach( value => {
      var jVal: String = "0"
      if (value.findValue("data").findValue("query").has("value")) {
        jVal = value.findValue("data").findValue("query").findValue("value").toString replaceAll("\"", "")
      }
      session = value.findValue("data").findValue("query").findValue("session").toString replaceAll("\"", "")
      user = value.findValue("data").findValue("query").findValue("user").toString replaceAll("\"", "")
      department = value.findValue("data").findValue("query").findValue("department").toString replaceAll("\"", "")
      lVal = if (jVal.toLong > lVal) jVal.toLong else lVal
    })

    val increaseTime = lVal - pVal
    previousVal.update(increaseTime)
    out.collect((session, user, department, increaseTime))
  }
}

这里有一个类似的例子。希望它能够合理地自我解释，并且应该很容易适应您的需要

这里的基本思想是您可以使用

context.windowState（）

，它是通过传递给ProcessWindowFunction的上下文提供的每个窗口状态。这个窗口状态实际上只对多次启动的窗口有用，因为每个新窗口实例都有一个新初始化的（空的）窗口状态存储。对于在所有窗口中共享的状态（但仍为键控状态），请使用

context.globalState（）

私有静态类函数
扩展ProcessWindowFunction{
私有最终静态值StateDescriptor previousFiringState=
新的ValueStateDescriptor（“上一次触发”，lonserializer.INSTANCE）；
私有最终静态还原状态描述符firingCounterState=
新的reduceStateDescriptor（“触发计数器”，new Sum（），longeserializer.INSTANCE）；
@凌驾
公共无效程序(
字符串键，
语境，
可比值，
收集器（输出）{
ValueState PreviousFireing=context.WindowsState（）.getState（previousFiringState）；
ReduceingState firingCounter=context.WindowsState（）.getState（FiringCountState）；
长输出=Iterables.getOnlyElement（值）；
if（firingCounter.get（）==null）{
//第一枪
out.collect（第2组，共（0L，输出））；
}否则{
//后续射击
collect（Tuple2.of（firingCounter.get（），output-previousFiring.value（））；
} 
firingCounter.添加（1L）；
更新（输出）；
}
@凌驾
公共空白清除（上下文）{
ValueState PreviousFireing=context.WindowsState（）.getState（previousFiringState）；
ReduceingState firingCounter=context.WindowsState（）.getState（FiringCountState）；
previousFiring.clear（）；
firingCounter.clear（）；
}
}

这似乎就是我想要的，我理解它试图做什么，但我没有看到我的

值状态在调用之间发生任何变化。注意：我必须从ProcessingTimeSessionWindows
切换到TumblingProcessingTimeWindows
，因为它不想在合并窗口中允许窗口状态（这很好）。我用新代码编辑了OP，pVal
始终为0，尽管我多次看到相同的sessionId，并且increaseVal非零。确实有延迟事件吗？如果你想分享你的新代码，我会看看。我用两个文件更新了OP。我非常确定这些事件是延迟的，因为我在每个会话中看到的打印输出不止一次（我相信如果没有延迟，windows将被处理一次）。我也有非常短的窗口（5秒）和额外的处理打印在超过5秒后。如果我能为您提供任何其他服务，请告诉我。感谢到目前为止所有的帮助！我不明白为什么它不能像你期望的那样工作。但是不要忘记在SessionProcessor上实现clear（）方法。我不确定，但是如果您使用ReductionState而不是ValueState，合并windows可能会支持WindowsState。
private static class DifferentialWindowFunction
  extends ProcessWindowFunction<Long, Tuple2<Long, Long>, String, TimeWindow> {

  private final static ValueStateDescriptor<Long> previousFiringState =
    new ValueStateDescriptor<>("previous-firing", LongSerializer.INSTANCE);

  private final static ReducingStateDescriptor<Long> firingCounterState =
    new ReducingStateDescriptor<>("firing-counter", new Sum(), LongSerializer.INSTANCE);

  @Override
  public void process(
      String key, 
      Context context, 
      Iterable<Long> values, 
      Collector<Tuple2<Long, Long>> out) {

    ValueState<Long> previousFiring = context.windowState().getState(previousFiringState);
    ReducingState<Long> firingCounter = context.windowState().getState(firingCounterState);

    Long output = Iterables.getOnlyElement(values);
    if (firingCounter.get() == null) {
      // first firing
      out.collect(Tuple2.of(0L, output));
    } else {
      // subsequent firing
      out.collect(Tuple2.of(firingCounter.get(), output - previousFiring.value()));    
    } 
    firingCounter.add(1L);
    previousFiring.update(output);
  }

  @Override
  public void clear(Context context) {
    ValueState<Long> previousFiring = context.windowState().getState(previousFiringState);
    ReducingState<Long> firingCounter = context.windowState().getState(firingCounterState);

    previousFiring.clear();
    firingCounter.clear();
  }
}