Apache spark spark结构化流媒体中的触发间隔控制_Apache Spark_Apache Kafka_Apache Spark Sql_Spark Structured Streaming

Apache spark spark结构化流媒体中的触发间隔控制

apache-spark apache-kafka

Apache spark spark结构化流媒体中的触发间隔控制,apache-spark,apache-kafka,apache-spark-sql,spark-structured-streaming,Apache Spark,Apache Kafka,Apache Spark Sql,Spark Structured Streaming,对于给定的场景，我希望结合连续和批处理触发器过滤结构化流中的数据集我知道这听起来不切实际，也许不可行。下面是我努力实现的目标让应用程序中设置的处理时间间隔为5分钟。将记录设置为以下模式： { "type":"record", "name":"event", "fields":[ { "name":"Student", "type":"string" }, { "name":"Subject", "type":

对于给定的场景，我希望结合连续和批处理触发器过滤结构化流中的数据集

我知道这听起来不切实际，也许不可行。下面是我努力实现的目标

让应用程序中设置的处理时间间隔为5分钟。将记录设置为以下模式：

  {
       "type":"record",
       "name":"event",
       "fields":[
         { "name":"Student", "type":"string" },
         { "name":"Subject", "type":"string" } 
   ]}

我的流媒体应用程序应该通过考虑以下两个标准之一将结果写入接收器

如果一个学生有5门以上的科目。（优先考虑本标准。）

触发器中提供的处理时间已过期

private static Injection<GenericRecord, byte[]> recordInjection;
private static StructType type;
public static final String USER_SCHEMA = "{"
        + "\"type\":\"record\","
        + "\"name\":\"alarm\","
        + "\"fields\":["
        + "  { \"name\":\"student\", \"type\":\"string\" },"
        + "  { \"name\":\"subject\", \"type\":\"string\" }"
        + "]}";

private static Schema.Parser parser = new Schema.Parser();

private static Schema schema = parser.parse(USER_SCHEMA);

static {
    recordInjection = GenericAvroCodecs.toBinary(schema);
    type = (StructType) SchemaConverters.toSqlType(schema).dataType();

}
sparkSession.udf().register("deserialize", (byte[] data) -> {
        GenericRecord record = recordInjection.invert(data).get();
        return RowFactory.create(record.get("student").toString(), record.get("subject").toString());
    }, DataTypes.createStructType(type.fields()));


Dataset<Row> ds2 = ds1
        .select("value").as(Encoders.BINARY())
        .selectExpr("deserialize(value) as rows")
        .select("rows.*")
        .selectExpr("student","subject");

StreamingQuery query1 = ds2
        .writeStream()
        .foreachBatch(
            new VoidFunction2<Dataset<Row>, Long>() {
              @Override
              public void call(Dataset<Row> rowDataset, Long aLong) throws Exception {
                rowDataset.select("student,concat(',',subject)").alias("value").groupBy("student");
              }
            }
        ).format("kafka")
        .option("kafka.bootstrap.servers", "localhost:9092")
        .option("topic", "new_in")
        .option("checkpointLocation", "checkpoint")
        .outputMode("append")
        .trigger(Trigger.ProcessingTime(10000))
        .start();
query1.awaitTermination();

在卡夫卡消费品控制台中，我期望如下所示

Test:{x,y,z,w,v} =>This should be the first response 
Test1:{x,y} => second
Test2:{x,y} => Third by the end of processing time

@杰克拉斯科夫斯基。该要求适用于每个微批次。因此，我不需要维护状态。那么为什么不使用

DataStreamWriter.foreachBatch

？似乎是最好的搭配。@JacekLaskowski我无法按学生姓名分组。我已经用我正在使用的代码更新了问题中的代码片段。请看一看，让我知道你的想法。

Test:{x,y,z,w,v} =>This should be the first response 
Test1:{x,y} => second
Test2:{x,y} => Third by the end of processing time