Apache kafka 在所有卡夫卡消费者结束后完成Flink计划
我在这里设置了一个最小的示例,其中我有N个Kakfa主题的N个流(下面的示例中为100个) 我想在每个流看到“EndofStream”消息时完成它。 当所有流都完成时,我希望Flink程序能够顺利完成。Apache kafka 在所有卡夫卡消费者结束后完成Flink计划,apache-kafka,apache-flink,flink-streaming,Apache Kafka,Apache Flink,Flink Streaming,我在这里设置了一个最小的示例,其中我有N个Kakfa主题的N个流(下面的示例中为100个) 我想在每个流看到“EndofStream”消息时完成它。 当所有流都完成时,我希望Flink程序能够顺利完成。 当parallelism设置为1时,这是正确的,但通常不会发生 从这个角度来看,似乎并非卡夫卡消费群体的所有线索都结束了 已建议抛出异常。但是,程序将在第一个异常时终止,并且不会等待所有流完成 我还添加了一个最小的python程序,将消息添加到kafka主题中,以实现再现性。请在每个程序中填写:
当parallelism设置为1时,这是正确的,但通常不会发生 从这个角度来看,似乎并非卡夫卡消费群体的所有线索都结束了 已建议抛出异常。但是,程序将在第一个异常时终止,并且不会等待所有流完成 我还添加了一个最小的python程序,将消息添加到kafka主题中,以实现再现性。请在每个程序中填写
:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
String outputPath = "file://" + System.getProperty("user.dir") + "/out/output";
Properties kafkaProps = null;
kafkaProps = new Properties();
String brokers = "<IP>:<PORT>";
kafkaProps.setProperty("bootstrap.servers", brokers);
kafkaProps.setProperty("auto.offset.reset", "earliest");
ArrayList<FlinkKafkaConsumer<String>> consumersList = new ArrayList<FlinkKafkaConsumer<String>>();
ArrayList<DataStream<String>> streamList = new ArrayList<DataStream<String>>();
for (int i = 0; i < 100; i++) {
consumersList.add(new FlinkKafkaConsumer<String>(Integer.toString(i),
new SimpleStringSchema() {
@Override
public boolean isEndOfStream(String nextElement) {
if (nextElement.contains("EndofStream")) {
// throw new RuntimeException("End of Stream");
return true;
} else {
return false;
}
}
}
, kafkaProps));
consumersList.get(i).setStartFromEarliest();
streamList.add(env.addSource(consumersList.get(i)));
streamList.get(i).writeAsText(outputPath + Integer.toString(i), WriteMode.OVERWRITE);
}
// execute program
env.execute("Flink Streaming Java API Skeleton");
最后一种方法是将每个主题写在单独的文件中,并将文件用作源文件,而不是kafka consumer。最终目标是测试flink处理特定程序的特定工作负载所需的时间。使用
cancel
方法,该方法是FlinkKafkaConsumer
的父类
public void cancel()说明已从接口:SourceFunction复制
取消源。大多数源在内部都有一个while循环
运行(SourceContext)方法。执行工作需要改进
确保源在使用此方法后将跳出该循环
被称为。一个典型的模式是使用“volatile boolean”
isRunning”标志,该标志在此方法中设置为false。那面旗是
在循环条件中检查
当源被取消时,执行线程也将被取消
中断(通过Thread.interrupt())。中断发生了
严格地说,在调用此方法之后,任何中断
处理程序可以依赖此方法已完成的事实。它是
使通过此方法更改的任何标志“易失性”的良好实践
以确保此方法的效果对任何
中断处理程序
指定人:在接口SourceFunction中取消
你说得对。必须使用
SimpleStringSchema
。这是基于这个答案。看看这个例子。首先,我发送了字符串Flink代码,我们看到它也可以在集群中工作,卡夫卡消费者会使用该消息。然后我发送shutdowndddd
,这对完成流也没有影响。最后,我发送了SHUTDOWN
,流作业就完成了。请参阅程序下面的日志
package org.sense.flink.examples.stream.kafka;
import java.util.Properties;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
public class KafkaConsumerQuery {
public KafkaConsumerQuery() throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "test");
FlinkKafkaConsumer myConsumer = new FlinkKafkaConsumer(java.util.regex.Pattern.compile("test"),
new MySimpleStringSchema(), properties);
DataStream<String> stream = env.addSource(myConsumer);
stream.print();
System.out.println("Execution plan >>>\n" + env.getExecutionPlan());
env.execute(KafkaConsumerQuery.class.getSimpleName());
}
private static class MySimpleStringSchema extends SimpleStringSchema {
private static final long serialVersionUID = 1L;
private final String SHUTDOWN = "SHUTDOWN";
@Override
public String deserialize(byte[] message) {
return super.deserialize(message);
}
@Override
public boolean isEndOfStream(String nextElement) {
if (SHUTDOWN.equalsIgnoreCase(nextElement)) {
return true;
}
return super.isEndOfStream(nextElement);
}
}
public static void main(String[] args) throws Exception {
new KafkaConsumerQuery();
}
}
在哪里调用此函数?我们无法访问isEndOfStream
函数中的FlinkKafkaConsumer
对象只需扩展此函数YourConsumer扩展FlinkKafkaConsumer
,或者您可以在收到字符串EndofStream
时调用cancel()
,因为cancel()
是超类的一种方法。正如您所说,cancel()
是超类的一部分,isEndOfStream
的检查是由一个私有变量完成的,因此扩展类似乎不是正确的想法。当我们收到EndOfStream
时,调用cancel()
的第二个建议也是一个挑战,因为flinkkafaconsumer
对象不是流的一部分。我不知道如何将使用者对象传递到流,但我将研究它。谢谢。我使用SimpleStringSchema
添加了一个示例,并更改了方法isEndOfStream()
parallelism.default: 2
cluster.evenly-spread-out-slots: true
package org.sense.flink.examples.stream.kafka;
import java.util.Properties;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
public class KafkaConsumerQuery {
public KafkaConsumerQuery() throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "test");
FlinkKafkaConsumer myConsumer = new FlinkKafkaConsumer(java.util.regex.Pattern.compile("test"),
new MySimpleStringSchema(), properties);
DataStream<String> stream = env.addSource(myConsumer);
stream.print();
System.out.println("Execution plan >>>\n" + env.getExecutionPlan());
env.execute(KafkaConsumerQuery.class.getSimpleName());
}
private static class MySimpleStringSchema extends SimpleStringSchema {
private static final long serialVersionUID = 1L;
private final String SHUTDOWN = "SHUTDOWN";
@Override
public String deserialize(byte[] message) {
return super.deserialize(message);
}
@Override
public boolean isEndOfStream(String nextElement) {
if (SHUTDOWN.equalsIgnoreCase(nextElement)) {
return true;
}
return super.isEndOfStream(nextElement);
}
}
public static void main(String[] args) throws Exception {
new KafkaConsumerQuery();
}
}
2020-07-02 16:39:59,025 INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-8, groupId=test] Discovered group coordinator localhost:9092 (id: 2147483647 rack: null)
3> Flink code we saw also works in a cluster. To run this code in a cluster
3> SHUTDOWNDDDDDDD
2020-07-02 16:40:27,973 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Sink: Print to Std. Out (3/4) (5f47c2b3f55c5eb558484d49fb1fcf0e) switched from RUNNING to FINISHED.
2020-07-02 16:40:27,973 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for Source: Custom Source -> Sink: Print to Std. Out (3/4) (5f47c2b3f55c5eb558484d49fb1fcf0e).
2020-07-02 16:40:27,974 INFO org.apache.flink.runtime.taskmanager.Task - Ensuring all FileSystem streams are closed for task Source: Custom Source -> Sink: Print to Std. Out (3/4) (5f47c2b3f55c5eb558484d49fb1fcf0e) [FINISHED]
2020-07-02 16:40:27,975 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Un-registering task and sending final execution state FINISHED to JobManager for task Source: Custom Source -> Sink: Print to Std. Out (3/4) 5f47c2b3f55c5eb558484d49fb1fcf0e.
2020-07-02 16:40:27,979 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source -> Sink: Print to Std. Out (3/4) (5f47c2b3f55c5eb558484d49fb1fcf0e) switched from RUNNING to FINISHED.