Apache flink 在ApacheFlink中对两个消息流使用相同的接收器

Apache flink 在ApacheFlink中对两个消息流使用相同的接收器,apache-flink,flink-streaming,Apache Flink,Flink Streaming,弗林克收到两种信息 控制消息->仅滚动文件 数据消息->将使用接收器存储在S3中 我们对这两条消息都有单独的源流。我们将同一个接收器连接到这两条流。 我们要做的是广播控制消息,以便所有并行运行的接收器都能接收到它 以下是相同的代码: package com.ranjit.com.flinkdemo; 导入org.apache.flink.streaming.api.datastream.datastream; 导入org.apache.flink.streaming.api.environmen

弗林克收到两种信息

  • 控制消息->仅滚动文件
  • 数据消息->将使用接收器存储在S3中
  • 我们对这两条消息都有单独的源流。我们将同一个接收器连接到这两条流。 我们要做的是广播控制消息,以便所有并行运行的接收器都能接收到它

    以下是相同的代码:

    package com.ranjit.com.flinkdemo;
    导入org.apache.flink.streaming.api.datastream.datastream;
    导入org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    导入org.apache.flink.streaming.connectors.fs.DateTimeBucketer;
    导入org.apache.flink.streaming.connectors.fs.RollingSink;
    导入org.apache.flink.streaming.connectors.fs.StringWriter;;
    公共级FlinkBroadcast{
    公共静态void main(字符串[]args)引发异常{
    最终StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
    环境署署长(2);
    DataStream ctrl\u message\u stream=env.socketTextStream(“localhost”,8088);
    ctrl_message_stream.broadcast();
    DataStream message_stream=env.socketextstream(“localhost”,8087);
    RollingSink接收器=新的RollingSink(“/base/path”);
    沉降计(新的日期时间计(“yyyy-MM-dd--HHmm”);
    sink.setWriter(新的StringWriter());
    sink.setBatchSize(1024*1024*400);//这是400 MB,
    ctrl_message_stream.broadcast().addSink(sink);
    消息流addSink(sink);
    环境执行(“流”);
    }
    }
    
    但我观察到的是,它正在创建4个接收器实例,并且控制消息仅被广播到2个接收器(由控制消息流创建)。 所以我的理解是,两个流都应该通过同一个操作符链来实现这一点,这是我们不希望看到的,因为在数据消息上会有多个转换。 我们已经编写了自己的接收器,它将读取消息,如果它是控制消息,那么它将只滚动文件

    示例代码:

    package com.gslab.com.dataSets;
    import java.io.File;
    import java.util.ArrayList;
    import java.util.List;
    import org.apache.avro.Schema;
    import org.apache.avro.generic.GenericData;
    import org.apache.avro.generic.GenericData.Record;
    import org.apache.avro.generic.GenericRecord;
    import org.apache.flink.api.common.functions.MapFunction;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    
    public class FlinkBroadcast {
        public static void main(String[] args) throws Exception {
    
            final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            env.setParallelism(2);
    
            List<String> controlMessageList = new ArrayList<String>();
            controlMessageList.add("controlMessage1");
            controlMessageList.add("controlMessage2");
    
            List<String> dataMessageList = new ArrayList<String>();
            dataMessageList.add("Person1");
            dataMessageList.add("Person2");
            dataMessageList.add("Person3");
            dataMessageList.add("Person4");
    
            DataStream<String> controlMessageStream  = env.fromCollection(controlMessageList);
            DataStream<String> dataMessageStream  = env.fromCollection(dataMessageList);
    
            DataStream<GenericRecord> controlMessageGenericRecordStream = controlMessageStream.map(new MapFunction<String, GenericRecord>() {
                @Override
                public GenericRecord map(String value) throws Exception {
                     Record gr = new GenericData.Record(new Schema.Parser().parse(new File("src/main/resources/controlMessageSchema.avsc")));
                     gr.put("TYPE", value);
                     return gr;
                }
            });
    
            DataStream<GenericRecord> dataMessageGenericRecordStream = dataMessageStream.map(new MapFunction<String, GenericRecord>() {
                @Override
                public GenericRecord map(String value) throws Exception {
                     Record gr = new GenericData.Record(new Schema.Parser().parse(new File("src/main/resources/dataMessageSchema.avsc")));
                     gr.put("FIRSTNAME", value);
                     gr.put("LASTNAME", value+": lastname");
                     return gr;
                }
            });
    
            //Displaying Generic records
            dataMessageGenericRecordStream.map(new MapFunction<GenericRecord, GenericRecord>() {
                @Override
                public GenericRecord map(GenericRecord value) throws Exception {
                    System.out.println("data before union: "+ value);
                    return value;
                }
            });
    
            controlMessageGenericRecordStream.broadcast().union(dataMessageGenericRecordStream).map(new MapFunction<GenericRecord, GenericRecord>() {
                @Override
                public GenericRecord map(GenericRecord value) throws Exception {
                    System.out.println("data after union: " + value);
                    return value;
                }
            });
            env.execute("stream");
        }
    }
    

    正如我们所看到的,LASTNAME值是不正确的,它被每个记录的FIRSTNAME值所取代

    您的代码实际上用您定义的接收器的各自副本终止了两个流。你想要的是这样的:

    final StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
    环境署署长(2);
    DataStream ctrl\u message\u stream=env.socketTextStream(“localhost”,8088);
    DataStream message_stream=env.socketextstream(“localhost”,8087);
    RollingSink接收器=新的RollingSink(“/base/path”);
    沉降计(新的日期时间计(“yyyy-MM-dd--HHmm”);
    sink.setWriter(新的StringWriter());
    sink.setBatchSize(1024*1024*400);//这是400MB,
    ctrl\u message\u stream.broadcast().union(message\u stream).addSink(sink);
    环境执行(“流”);
    
    谢谢,但当我在联合后在接收器中打印数据时,它不是精确的数据。i、 e.所有字段值都将替换为第一个字段值。请澄清一下,例如,给出两个插座的输入示例和您得到的输出示例?这很奇怪。请为每个数据流打印
    类型信息
    。您可以使用
    DataStream.getType()
    ,即
    System.out.println(dataMessageGenericRecordStream.getType())
    。打印dataMessageGenericRecordStream.getType():GenericType打印控制MessageGenericRecordStream.getType():GenericType@RanjitShinde谢谢请您解释一下,为什么在第一个流中添加
    broadcast()
    stage?
    05/09/2016 13:02:12 Source: Collection Source(1/1) switched to FINISHED 
    05/09/2016 13:02:12 Source: Collection Source(1/1) switched to FINISHED 
    05/09/2016 13:02:13 Map(1/2) switched to FINISHED 
    05/09/2016 13:02:13 Map(2/2) switched to FINISHED 
    data after union: {"TYPE": "controlMessage1"}
    data before union: {"FIRSTNAME": "Person2", "LASTNAME": "Person2: lastname"}
    data after union: {"TYPE": "controlMessage1"}
    data before union: {"FIRSTNAME": "Person1", "LASTNAME": "Person1: lastname"}
    data after union: {"TYPE": "controlMessage2"}
    data after union: {"TYPE": "controlMessage2"}
    data after union: {"FIRSTNAME": "Person1", "LASTNAME": "Person1"}
    data before union: {"FIRSTNAME": "Person4", "LASTNAME": "Person4: lastname"}
    data before union: {"FIRSTNAME": "Person3", "LASTNAME": "Person3: lastname"}
    data after union: {"FIRSTNAME": "Person2", "LASTNAME": "Person2"}
    data after union: {"FIRSTNAME": "Person3", "LASTNAME": "Person3"}
    05/09/2016 13:02:13 Map -> Map(2/2) switched to FINISHED 
    data after union: {"FIRSTNAME": "Person4", "LASTNAME": "Person4"}
    05/09/2016 13:02:13 Map -> Map(1/2) switched to FINISHED 
    05/09/2016 13:02:13 Map(1/2) switched to FINISHED 
    05/09/2016 13:02:13 Map(2/2) switched to FINISHED 
    05/09/2016 13:02:13 Job execution switched to status FINISHED.