Apache flink 输出中存在重复的键_Apache Flink_Flink Streaming

Apache flink 输出中存在重复的键

apache-flink

Apache flink 输出中存在重复的键,apache-flink,flink-streaming,Apache Flink,Flink Streaming,我正在试用ApacheFlink，为了测试我的学习知识，我正在玩经典的字数问题这是我的密码： public class TestWordCount { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.se

我正在试用ApacheFlink，为了测试我的学习知识，我正在玩经典的字数问题

这是我的密码：

public class TestWordCount {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        DataStreamSource<String> addSource = env.addSource(new TestSource());

        DataStream<Tuple2<String, Integer>> sum = addSource
        .flatMap(new Tokenizer())
        .keyBy(0)
        .sum(1);

        sum.print();
        env.execute();
    }

}

class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {

    private static final long serialVersionUID = 1L;

    @Override
    public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
        for(String part: value.split(" "))
            out.collect(new Tuple2<>(part.toLowerCase(), 1));
    }
}

class TestSource implements SourceFunction<String> {

    private static final long serialVersionUID = 1L;
    String s = "Hadoop is the Elephant King! A yellow and elegant thing. He never forgets. The Useful data, or lets An extraneous element cling!";

    @Override
    public void run(SourceContext<String> ctx) throws Exception {
        ctx.collect(s);
    }

    @Override
    public void cancel() {
    }
}

公共类TestWordCount{
公共静态void main（字符串[]args）引发异常{
StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment（）；
环境(一)；
DataStreamSource addSource=env.addSource（newtestsource（））；
DataStream sum=addSource
.flatMap（新标记器（））
.keyBy（0）
.总数（1）；
sum.print（）；
execute（）；
}
}
类标记器实现FlatMapFunction{
私有静态最终长serialVersionUID=1L；
@凌驾
公共void flatMap（字符串值，收集器输出）引发异常{
for（字符串部分：value.split（“”）
collect（新的Tuple2（part.toLowerCase（），1））；
}
}
类TestSource实现SourceFunction{
私有静态最终长serialVersionUID=1L；
String s=“Hadoop是象王！一个黄色而优雅的东西。他永远不会忘记。有用的数据，或者让无关的元素粘住！”；
@凌驾
公共无效运行（SourceContext ctx）引发异常{
收取费用；
}
@凌驾
公开作废取消（）{
}
}

当我运行它时，输出如下：

（hadoop，1）（is，1）（第一部分，1）（大象，1）（国王，1）（a，1）（黄色，1）（和，1）（1）（一）（他，1）（从未，1）（忘了，1）（第二章）（有用，1）（数据，1）（或，1）（一）（安，1）（无关，1）（第1部分）（紧紧抓住，1）

我只是好奇，为什么

这个会来两次，就像（the，1）
和（the，2）

非常感谢您的帮助
为什么会有两次呢
我相信你已经发送了两次“the”。（the，1）是发送第一个“the”时的计数，（the，2）是发送第二个“the”时的计数
每次接收到元素并输出时，总和都会聚合数据。
处理数据流时，输入是无限的，因此不可能等到“结束”才打印出结果。“最终报告”的概念毫无意义。所以，到目前为止，您得到的是一个不断更新的结果流