Apache flink K8 HA模式下的Flink击剑错误
我正在使用Flink 1.12,并试图通过Kubernetes集群(AKS)将job manager保留在HA中。我正在运行2个作业管理器和2个任务管理器吊舱 我面临的问题是任务经理无法找到jobmanager负责人 原因是他们试图为jobmanager(一种集群服务)提供K8“服务”,而不是访问领导者的pod IP。因此,有时jobmanager服务将解决对备用jobmanager的注册调用,这使得TaskManager无法找到jobmanager负责人 以下是jobmanager领导文件的内容Apache flink K8 HA模式下的Flink击剑错误,apache-flink,flink-streaming,Apache Flink,Flink Streaming,我正在使用Flink 1.12,并试图通过Kubernetes集群(AKS)将job manager保留在HA中。我正在运行2个作业管理器和2个任务管理器吊舱 我面临的问题是任务经理无法找到jobmanager负责人 原因是他们试图为jobmanager(一种集群服务)提供K8“服务”,而不是访问领导者的pod IP。因此,有时jobmanager服务将解决对备用jobmanager的注册调用,这使得TaskManager无法找到jobmanager负责人 以下是jobmanager领导文件的内
{
"apiVersion": "v1",
"data": {
"address": "akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2",
"sessionId": "ee14c446-82b0-45ab-b470-ee445ddd0e0f"
},
"kind": "ConfigMap",
"metadata": {
"annotations": {
"control-plane.alpha.kubernetes.io/leader": "{\"holderIdentity\":\"e6a42a4f-235e-4b97-93c6-40f4b987f56b\",\"leaseDuration\":15.000000000,\"acquireTime\":\"2021-02-16T05:13:37.365000Z\",\"renewTime\":\"2021-02-16T05:22:17.386000Z\",\"leaderTransitions\":105}"
},
"creationTimestamp": "2021-02-15T16:13:26Z",
"labels": {
"app": "flinktestk8cluster",
"configmap-type": "high-availability",
"type": "flink-native-kubernetes"
},
"name": "flinktestk8cluster-bc7b6f9aa8b0a111e1c50b10155a85be-jobmanager-leader",
"namespace": "default",
"resourceVersion": "46202881",
"selfLink": "/api/v1/namespaces/default/configmaps/flinktestk8cluster-bc7b6f9aa8b0a111e1c50b10155a85be-jobmanager-leader",
"uid": "1d5ca6e3-dc7e-4fb7-9fab-c1bbb956cda9"
}
}
此处flink jobmanager
是jobmanager的K8服务的名称
有办法解决这个问题吗?如何让jobmanager在leader文件中写入podIP而不是服务名称
这是个例外
2021-02-12 06:15:53,849 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Registration at ResourceManager failed due to an error
java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(954fe694bb4d268a2e32b4497e944144, RemoteRpcInvocation(registerTaskExecutor(TaskExecutorRegistration, Time))) sent to akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_0 because the fencing token is null.
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_275]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_275]
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:661) ~[?:1.8.0_275]
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646) ~[?:1.8.0_275]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_275]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_275]
at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:235) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_275]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_275]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_275]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_275]
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1044) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.dispatch.OnComplete.internal(Future.scala:263) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.dispatch.OnComplete.internal(Future.scala:261) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:573) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:101) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:999) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.actor.Actor.aroundReceive(Actor.scala:517) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.actor.Actor.aroundReceive$(Actor.scala:515) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:458) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.actor.ActorCell.invoke(ActorCell.scala:561) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.dispatch.Mailbox.run(Mailbox.scala:225) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.12-1.12.1.jar:1.12.1]
Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(954fe694bb4d268a2e32b4497e944144, RemoteRpcInvocation(registerTaskExecutor(TaskExecutorRegistration, Time))) sent to akka.tcp://flink@flink-jobmanager:6123/user/rpc/resourcemanager_0 because the fencing token is null.
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:67) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:159) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.actor.Actor.aroundReceive(Actor.scala:517) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.actor.Actor.aroundReceive$(Actor.scala:515) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
... 9 more
2021-02-12 06:15:53,849 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Pausing and re-attempting registration in 10000 ms
问题是,在使用备用JobManager时,您希望为JobManager吊舱提供唯一的地址。因此,您不能配置组件用于相互通信的服务。相反,您应该以pod IP作为其
JobManager.rpc.address
启动JobManager吊舱
为了使用IP启动每个JobManager机架,您不能配置包含Flink配置的ConfigMap,因为每个JobManager机架的配置都是相同的。相反,您需要将以下代码段添加到JobManager部署中:
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: FLINK_PROPERTIES
value: |
jobmanager.rpc.address: ${MY_POD_IP}
这样,我们告诉每个JobManager pod使用pod的IP作为JobManager.rpc.address
,该地址也会写入K8s HA服务。如果这样做,那么在K8s集群内运行的每个K8s HA服务用户都可以找到当前的领导者
接下来,您需要为要使用K8s HA服务的所有部署进行配置。您可以通过扩展FLINK_属性
env变量来实现这一点:
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: FLINK_PROPERTIES
value: |
jobmanager.rpc.address: ${MY_POD_IP}
kubernetes.cluster-id: foobar
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.storageDir: hdfs:///flink/recovery
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 10
将此添加到JobManager吊舱定义和
env:
- name: FLINK_PROPERTIES
value: |
kubernetes.cluster-id: foobar
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.storageDir: hdfs:///flink/recovery
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 10
到TaskManager部署应该可以解决此问题
可以在这里找到完整的部署YAML
apiVersion: apps/v1
kind: Deployment
metadata:
name: flink-jobmanager
spec:
replicas: 1
selector:
matchLabels:
app: flink
component: jobmanager
template:
metadata:
labels:
app: flink
component: jobmanager
spec:
containers:
- name: jobmanager
image: flink:1.12.1
args: ["jobmanager"]
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: FLINK_PROPERTIES
value: |
jobmanager.rpc.address: ${MY_POD_IP}
kubernetes.cluster-id: foobar
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.storageDir: file:///flink/recovery
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 10
ports:
- containerPort: 6123
name: rpc
- containerPort: 6124
name: blob-server
- containerPort: 8081
name: webui
livenessProbe:
tcpSocket:
port: 6123
initialDelaySeconds: 30
periodSeconds: 60
securityContext:
runAsUser: 9999 # refers to user _flink_ from official flink image, change if necessary
问题是,在使用备用JobManager时,您希望为JobManager吊舱提供唯一的地址。因此,您不能配置组件用于相互通信的服务。相反,您应该以pod IP作为其
JobManager.rpc.address
启动JobManager吊舱
为了使用IP启动每个JobManager机架,您不能配置包含Flink配置的ConfigMap,因为每个JobManager机架的配置都是相同的。相反,您需要将以下代码段添加到JobManager部署中:
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: FLINK_PROPERTIES
value: |
jobmanager.rpc.address: ${MY_POD_IP}
这样,我们告诉每个JobManager pod使用pod的IP作为JobManager.rpc.address
,该地址也会写入K8s HA服务。如果这样做,那么在K8s集群内运行的每个K8s HA服务用户都可以找到当前的领导者
接下来,您需要为要使用K8s HA服务的所有部署进行配置。您可以通过扩展FLINK_属性
env变量来实现这一点:
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: FLINK_PROPERTIES
value: |
jobmanager.rpc.address: ${MY_POD_IP}
kubernetes.cluster-id: foobar
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.storageDir: hdfs:///flink/recovery
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 10
将此添加到JobManager吊舱定义和
env:
- name: FLINK_PROPERTIES
value: |
kubernetes.cluster-id: foobar
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.storageDir: hdfs:///flink/recovery
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 10
到TaskManager部署应该可以解决此问题
可以在这里找到完整的部署YAML
apiVersion: apps/v1
kind: Deployment
metadata:
name: flink-jobmanager
spec:
replicas: 1
selector:
matchLabels:
app: flink
component: jobmanager
template:
metadata:
labels:
app: flink
component: jobmanager
spec:
containers:
- name: jobmanager
image: flink:1.12.1
args: ["jobmanager"]
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: FLINK_PROPERTIES
value: |
jobmanager.rpc.address: ${MY_POD_IP}
kubernetes.cluster-id: foobar
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.storageDir: file:///flink/recovery
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 10
ports:
- containerPort: 6123
name: rpc
- containerPort: 6124
name: blob-server
- containerPort: 8081
name: webui
livenessProbe:
tcpSocket:
port: 6123
initialDelaySeconds: 30
periodSeconds: 60
securityContext:
runAsUser: 9999 # refers to user _flink_ from official flink image, change if necessary
我是否需要完全删除flink-conf.yaml configmap,因为属性是通过env“flink_属性”设置的?是的,这是必需的。不幸的是,通过
FLINK_属性设置配置和基于ConfigMap的FLINK-conf.yaml
不能很好地配合使用。因此我尝试使用此配置,但我得到了/docker-entrypoint.sh:line 95:/opt/FLINK/conf/FLINK-conf.yaml:没有这样的文件或目录来解决问题,我尝试创建空的flink-conf.yaml但我猛击pod显示flink_属性没有附加到conf中。而且,echo flink_属性显示jobmanager.rpc.address:${MY_pod_IP}
所以我的pod_IP从来没有被替换过。你到底在运行哪个Docker映像?我强烈建议将此版本与1.12或1.11版一起使用。应该是正确的Dockerfile。重要的一点是不要使用ConfigMap backedflink-conf.yaml
,而是使用Docker映像中包含的flink-conf.yaml
,并通过flink\u属性对其进行自定义。是否需要完全删除flink-conf.yaml ConfigMap,因为属性是通过env“flink\u属性”设置的?是的,这是必需的。不幸的是,通过FLINK_属性设置配置和基于ConfigMap的FLINK-conf.yaml
不能很好地配合使用。因此我尝试使用此配置,但我得到了/docker-entrypoint.sh:line 95:/opt/FLINK/conf/FLINK-conf.yaml:没有这样的文件或目录来解决问题,我尝试创建空的flink-conf.yaml但我猛击pod显示flink_属性没有附加到conf中。而且,echo flink_属性显示jobmanager.rpc.address:${MY_pod_IP}
所以我的pod_IP从来没有被替换过。你到底在运行哪个Docker映像?我强烈建议将此版本与1.12或1.11版一起使用。应该是正确的Dockerfile。重要的一点是不要使用ConfigMap backedflink-conf.yaml
,而是使用Docker映像中包含的flink-conf.yaml
,并通过flink\u属性对其进行自定义。