Apache flink Flink Python API java.io.EOFException
我正在用pythonapi编写三个批处理应用程序,然而,我的第三个应用程序正在处理一些异常,特别是当我增加并行性时。这个应用程序内部有一个交叉转换 集群有4个虚拟机,每台机器有4个cpu核和7GB RAM。因此,最大并行度设置为16 例外情况是:Apache flink Flink Python API java.io.EOFException,apache-flink,Apache Flink,我正在用pythonapi编写三个批处理应用程序,然而,我的第三个应用程序正在处理一些异常,特别是当我增加并行性时。这个应用程序内部有一个交叉转换 集群有4个虚拟机,每台机器有4个cpu核和7GB RAM。因此,最大并行度设置为16 例外情况是: org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed. atorg.a
org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed.
atorg.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)
at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926)
at org.apache.flink.python.api.PythonPlanBinder.runPlan(PythonPlanBinder.java:149)
at org.apache.flink.python.api.PythonPlanBinder.main(PythonPlanBinder.java:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:419)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:339)
at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:831)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:256)
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1073)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1120)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1117)
at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1116)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:900)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:843)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:843)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.flink.python.api.streaming.data.PythonStreamer.streamBufferWithoutGroups(PythonStreamer.java:252)
at org.apache.flink.python.api.functions.PythonMapPartition.mapPartition(PythonMapPartition.java:54)
at org.apache.flink.runtime.operators.MapPartitionDriver.run(MapPartitionDriver.java:103)
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:490)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
at java.lang.Thread.run(Thread.java:745)
我有一些证据:1) 这完全不是确定性的,最终应用程序完成时没有错误。
2) 您使用的是哪个版本的flink?python版本也会很有趣。这个异常意味着python进程既没有干净地关闭,也没有出现错误,这很奇怪。明天我将尝试复制这个。如果您愿意自己调试,可以修改catch block PythonStreamer#streamBufferWithoutGroups方法来捕获所有异常。这样,如果python进程中发生错误,则应将其包含在异常消息中。@ChesnaySchepler我使用的是Flink最新稳定版本:1.2.0。Hadoop:2.7.3和Python:3.5.2。@ChesnaySchepler如果您想访问Taskmanager原始日志,我上传到:我的应用程序源代码名为CalculateSimilarity.py,可从git获得:也许我的Python代码引发了一些异常,我没有捕获到它?我将在应用程序的python级别改进异常句柄。如果它不起作用,我将尝试从源代码构建Flink,并在java PythonStreamer上进行修改。整个用户定义的函数执行被包装在一个try-except块中;所有异常都应由Flink处理。但是错误检测/报告有点脆弱:/您使用的是哪个版本的flink?python版本也会很有趣。这个异常意味着python进程既没有干净地关闭,也没有出现错误,这很奇怪。明天我将尝试复制这个。如果您愿意自己调试,可以修改catch block PythonStreamer#streamBufferWithoutGroups方法来捕获所有异常。这样,如果python进程中发生错误,则应将其包含在异常消息中。@ChesnaySchepler我使用的是Flink最新稳定版本:1.2.0。Hadoop:2.7.3和Python:3.5.2。@ChesnaySchepler如果您想访问Taskmanager原始日志,我上传到:我的应用程序源代码名为CalculateSimilarity.py,可从git获得:也许我的Python代码引发了一些异常,我没有捕获到它?我将在应用程序的python级别改进异常句柄。如果它不起作用,我将尝试从源代码构建Flink,并在java PythonStreamer上进行修改。整个用户定义的函数执行被包装在一个try-except块中;所有异常都应由Flink处理。但错误检测/报告有点脆弱:/