Apache flink Apache Flink联合运算符给出错误响应

Apache flink Apache Flink联合运算符给出错误响应,apache-flink,flink-streaming,flink-cep,Apache Flink,Flink Streaming,Flink Cep,我在两个通用记录类型的数据流上应用union运算符 package com.gslab.com.dataset; 导入java.io.File; 导入java.util.ArrayList; 导入java.util.List; 导入org.apache.avro.Schema; 导入org.apache.avro.generic.GenericData; 导入org.apache.avro.generic.GenericData.Record; 导入org.apache.avro.generic

我在两个通用记录类型的
数据流上应用union运算符

package com.gslab.com.dataset;
导入java.io.File;
导入java.util.ArrayList;
导入java.util.List;
导入org.apache.avro.Schema;
导入org.apache.avro.generic.GenericData;
导入org.apache.avro.generic.GenericData.Record;
导入org.apache.avro.generic.GenericRecord;
导入org.apache.flink.api.common.functions.MapFunction;
导入org.apache.flink.streaming.api.datastream.datastream;
导入org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
公共级FlinkBroadcast{
公共静态void main(字符串[]args)引发异常{
最终StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
环境署署长(2);
List controlMessageList=新建ArrayList();
controlMessageList.add(“controlMessage1”);
controlMessageList.add(“controlMessage2”);
List dataMessageList=新建ArrayList();
dataMessageList.add(“Person1”);
dataMessageList.add(“Person2”);
dataMessageList.add(“Person3”);
dataMessageList.add(“Person4”);
DataStream controlMessageStream=env.fromCollection(controlMessageList);
DataStream dataMessageStream=env.fromCollection(dataMessageList);
DataStream controlMessageGenericRecordStream=controlMessageStream.map(新的MapFunction(){
@凌驾
公共GenericRecord映射(字符串值)引发异常{
Record gr=new GenericData.Record(new Schema.Parser().parse(新文件(“src/main/resources/controlMessageSchema.avsc”));
gr.put(“类型”,值);
返回gr;
}
});
DataStream dataMessageGenericRecordStream=dataMessageStream.map(新的MapFunction(){
@凌驾
公共GenericRecord映射(字符串值)引发异常{
Record gr=new GenericData.Record(newschema.Parser().parse(新文件(“src/main/resources/dataMessageSchema.avsc”));
gr.put(“名字”,值);
gr.put(“LASTNAME”,value+“:LASTNAME”);
返回gr;
}
});
//显示通用记录
dataMessageGenericRecordStream.map(新的MapFunction(){
@凌驾
公共GenericRecord映射(GenericRecord值)引发异常{
System.out.println(“联合前的数据:+值);
返回值;
}
});
controlMessageGenericRecordStream.broadcast().union(dataMessageGenericRecordStream.map)(新的MapFunction(){
@凌驾
公共GenericRecord映射(GenericRecord值)引发异常{
System.out.println(“并集后的数据:+值”);
返回值;
}
});
环境执行(“流”);
}
}
输出:

05/09/2016 13:02:13 Map(2/2) switched to FINISHED 
data after union: {"TYPE": "controlMessage1"}
data before union: {"FIRSTNAME": "Person2", "LASTNAME": "Person2: lastname"}
data after union: {"TYPE": "controlMessage1"}
data before union: {"FIRSTNAME": "Person1", "LASTNAME": "Person1: lastname"}
data after union: {"TYPE": "controlMessage2"}
data after union: {"TYPE": "controlMessage2"}
data after union: {"FIRSTNAME": "Person1", "LASTNAME": "Person1"}
data before union: {"FIRSTNAME": "Person4", "LASTNAME": "Person4: lastname"}
data before union: {"FIRSTNAME": "Person3", "LASTNAME": "Person3: lastname"}
data after union: {"FIRSTNAME": "Person2", "LASTNAME": "Person2"}
data after union: {"FIRSTNAME": "Person3", "LASTNAME": "Person3"}
05/09/2016 13:02:13 Map -> Map(2/2) switched to FINISHED 
data after union: {"FIRSTNAME": "Person4", "LASTNAME": "Person4"}
05/09/2016 13:02:13 Map -> Map(1/2) switched to FINISHED 
05/09/2016 13:02:13 Map(1/2) switched to FINISHED 
05/09/2016 13:02:13 Map(2/2) switched to FINISHED 
05/09/2016 13:02:13 Job execution switched to status FINISHED.

正如您所看到的,dataMessageGenericRecordStream中的记录在联合后是不正确的。所有字段值都将被第一个字段值替换。

我在DataSet API中遇到了类似的问题。我读了一些Avro文件作为GenericRecords,看到了这种奇怪的行为。我使用了这个解决方法-我没有将它们作为GenericRecords读取,而是将它们作为特定记录(例如MyAvroObject)读取,然后使用映射将它们转换/类型转换为GenericRecords

我编写了一些代码来使用DataSet API测试您的用例,它与上述解决方法一起工作-

publicstaticvoidmaintest(字符串[]args)引发异常{
ExecutionEnvironment env=ExecutionEnvironment.getExecutionEnvironment();
环境署署长(2);
List QueryList 1=新的ArrayList();
QueryList 1.添加(“query1”);
QueryList 1.添加(“query2”);
List QueryList 2=新的ArrayList();
QueryList 2.添加(“QUERY1”);
QueryList 2.添加(“QUERY2”);
QueryList 2.添加(“QUERY3”);
QueryList 2.添加(“QUERY4”);
DataSet dataset1=env.fromCollection(queryList 1);
DataSet dataset2=env.fromCollection(queryList2);
DataSet genericDS1=dataset1.map(新的MapFunction(){
@凌驾
公共GenericRecord映射(字符串值)引发异常{
Query Query=Query.newBuilder().setQuery(value.build();
返回(GenericRecord)查询;
}
});
DataSet genericDS2=dataset2.map(新的MapFunction(){
@凌驾
公共GenericRecord映射(字符串值)引发异常{
SearchEngineQuery SearchEngineQuery=SearchEngineQuery.newBuilder().setSeQuery(value.build();
返回(GenericRecord)搜索引擎;
}
});
genericDS2.map(新的MapFunction(){
@凌驾
公共GenericRecord映射(GenericRecord值)引发异常{
System.out.println(“调试:联合之前的数据:“+值”);
返回值;
}
});
union(genericDS2).map(新的映射函数(){
@凌驾
公共GenericRecord映射(GenericRecord值)引发异常{
System.out.println(“调试:联合后的数据:“+值”);
返回值;
}
}).print();
}
其中查询和搜索引擎查询是我的Avro对象(类似于控制消息列表和数据消息列表)

输出:

{"query": "query1"}
{"se_query": "QUERY1"}
{"se_query": "QUERY3"}
{"query": "query2"}
{"se_query": "QUERY2"}
{"se_query": "QUERY4"}

我花了几天时间针对另一个问题(但仍涉及GenericRecord)对此进行调查,并找到了根本原因和解决方案

根本原因:在Apache Avro“Schema.class”中,“field”位置是暂时的,不会被Kryo序列化,因此在Flink管道中反序列化时会初始化为位置“0”

见JIRA AVRO-1476,其中描述了这一点,并特别提到kyro序列化

这在Avro 1.7.7中已修复

解决方案:Flink必须使用Avro 1.7.7(或更高版本)。我已经在本地机器b中验证了修复