Apache flink 在ApacheFlink中对两个消息流使用相同的接收器
弗林克收到两种信息Apache flink 在ApacheFlink中对两个消息流使用相同的接收器,apache-flink,flink-streaming,Apache Flink,Flink Streaming,弗林克收到两种信息 控制消息->仅滚动文件 数据消息->将使用接收器存储在S3中 我们对这两条消息都有单独的源流。我们将同一个接收器连接到这两条流。 我们要做的是广播控制消息,以便所有并行运行的接收器都能接收到它 以下是相同的代码: package com.ranjit.com.flinkdemo; 导入org.apache.flink.streaming.api.datastream.datastream; 导入org.apache.flink.streaming.api.environmen
package com.ranjit.com.flinkdemo;
导入org.apache.flink.streaming.api.datastream.datastream;
导入org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
导入org.apache.flink.streaming.connectors.fs.DateTimeBucketer;
导入org.apache.flink.streaming.connectors.fs.RollingSink;
导入org.apache.flink.streaming.connectors.fs.StringWriter;;
公共级FlinkBroadcast{
公共静态void main(字符串[]args)引发异常{
最终StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
环境署署长(2);
DataStream ctrl\u message\u stream=env.socketTextStream(“localhost”,8088);
ctrl_message_stream.broadcast();
DataStream message_stream=env.socketextstream(“localhost”,8087);
RollingSink接收器=新的RollingSink(“/base/path”);
沉降计(新的日期时间计(“yyyy-MM-dd--HHmm”);
sink.setWriter(新的StringWriter());
sink.setBatchSize(1024*1024*400);//这是400 MB,
ctrl_message_stream.broadcast().addSink(sink);
消息流addSink(sink);
环境执行(“流”);
}
}
但我观察到的是,它正在创建4个接收器实例,并且控制消息仅被广播到2个接收器(由控制消息流创建)。
所以我的理解是,两个流都应该通过同一个操作符链来实现这一点,这是我们不希望看到的,因为在数据消息上会有多个转换。
我们已经编写了自己的接收器,它将读取消息,如果它是控制消息,那么它将只滚动文件
示例代码:
package com.gslab.com.dataSets;
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericData.Record;
import org.apache.avro.generic.GenericRecord;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class FlinkBroadcast {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
List<String> controlMessageList = new ArrayList<String>();
controlMessageList.add("controlMessage1");
controlMessageList.add("controlMessage2");
List<String> dataMessageList = new ArrayList<String>();
dataMessageList.add("Person1");
dataMessageList.add("Person2");
dataMessageList.add("Person3");
dataMessageList.add("Person4");
DataStream<String> controlMessageStream = env.fromCollection(controlMessageList);
DataStream<String> dataMessageStream = env.fromCollection(dataMessageList);
DataStream<GenericRecord> controlMessageGenericRecordStream = controlMessageStream.map(new MapFunction<String, GenericRecord>() {
@Override
public GenericRecord map(String value) throws Exception {
Record gr = new GenericData.Record(new Schema.Parser().parse(new File("src/main/resources/controlMessageSchema.avsc")));
gr.put("TYPE", value);
return gr;
}
});
DataStream<GenericRecord> dataMessageGenericRecordStream = dataMessageStream.map(new MapFunction<String, GenericRecord>() {
@Override
public GenericRecord map(String value) throws Exception {
Record gr = new GenericData.Record(new Schema.Parser().parse(new File("src/main/resources/dataMessageSchema.avsc")));
gr.put("FIRSTNAME", value);
gr.put("LASTNAME", value+": lastname");
return gr;
}
});
//Displaying Generic records
dataMessageGenericRecordStream.map(new MapFunction<GenericRecord, GenericRecord>() {
@Override
public GenericRecord map(GenericRecord value) throws Exception {
System.out.println("data before union: "+ value);
return value;
}
});
controlMessageGenericRecordStream.broadcast().union(dataMessageGenericRecordStream).map(new MapFunction<GenericRecord, GenericRecord>() {
@Override
public GenericRecord map(GenericRecord value) throws Exception {
System.out.println("data after union: " + value);
return value;
}
});
env.execute("stream");
}
}
正如我们所看到的,LASTNAME值是不正确的,它被每个记录的FIRSTNAME值所取代您的代码实际上用您定义的接收器的各自副本终止了两个流。你想要的是这样的:
final StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
环境署署长(2);
DataStream ctrl\u message\u stream=env.socketTextStream(“localhost”,8088);
DataStream message_stream=env.socketextstream(“localhost”,8087);
RollingSink接收器=新的RollingSink(“/base/path”);
沉降计(新的日期时间计(“yyyy-MM-dd--HHmm”);
sink.setWriter(新的StringWriter());
sink.setBatchSize(1024*1024*400);//这是400MB,
ctrl\u message\u stream.broadcast().union(message\u stream).addSink(sink);
环境执行(“流”);
谢谢,但当我在联合后在接收器中打印数据时,它不是精确的数据。i、 e.所有字段值都将替换为第一个字段值。请澄清一下,例如,给出两个插座的输入示例和您得到的输出示例?这很奇怪。请为每个数据流打印类型信息
。您可以使用DataStream.getType()
,即System.out.println(dataMessageGenericRecordStream.getType())
。打印dataMessageGenericRecordStream.getType():GenericType打印控制MessageGenericRecordStream.getType():GenericType@RanjitShinde谢谢请您解释一下,为什么在第一个流中添加broadcast()
stage?
05/09/2016 13:02:12 Source: Collection Source(1/1) switched to FINISHED
05/09/2016 13:02:12 Source: Collection Source(1/1) switched to FINISHED
05/09/2016 13:02:13 Map(1/2) switched to FINISHED
05/09/2016 13:02:13 Map(2/2) switched to FINISHED
data after union: {"TYPE": "controlMessage1"}
data before union: {"FIRSTNAME": "Person2", "LASTNAME": "Person2: lastname"}
data after union: {"TYPE": "controlMessage1"}
data before union: {"FIRSTNAME": "Person1", "LASTNAME": "Person1: lastname"}
data after union: {"TYPE": "controlMessage2"}
data after union: {"TYPE": "controlMessage2"}
data after union: {"FIRSTNAME": "Person1", "LASTNAME": "Person1"}
data before union: {"FIRSTNAME": "Person4", "LASTNAME": "Person4: lastname"}
data before union: {"FIRSTNAME": "Person3", "LASTNAME": "Person3: lastname"}
data after union: {"FIRSTNAME": "Person2", "LASTNAME": "Person2"}
data after union: {"FIRSTNAME": "Person3", "LASTNAME": "Person3"}
05/09/2016 13:02:13 Map -> Map(2/2) switched to FINISHED
data after union: {"FIRSTNAME": "Person4", "LASTNAME": "Person4"}
05/09/2016 13:02:13 Map -> Map(1/2) switched to FINISHED
05/09/2016 13:02:13 Map(1/2) switched to FINISHED
05/09/2016 13:02:13 Map(2/2) switched to FINISHED
05/09/2016 13:02:13 Job execution switched to status FINISHED.