Apache flink Flink Python API java.io.EOFException

Apache flink Flink Python API java.io.EOFException,apache-flink,Apache Flink,我正在用pythonapi编写三个批处理应用程序,然而,我的第三个应用程序正在处理一些异常,特别是当我增加并行性时。这个应用程序内部有一个交叉转换 集群有4个虚拟机,每台机器有4个cpu核和7GB RAM。因此,最大并行度设置为16 例外情况是: org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed. atorg.a

我正在用pythonapi编写三个批处理应用程序,然而,我的第三个应用程序正在处理一些异常,特别是当我增加并行性时。这个应用程序内部有一个交叉转换

集群有4个虚拟机,每台机器有4个cpu核和7GB RAM。因此,最大并行度设置为16

例外情况是:

 org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed.  
    atorg.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)  
    at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101)  
    at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400)  
    at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)  
    at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)  
    at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926)  
    at org.apache.flink.python.api.PythonPlanBinder.runPlan(PythonPlanBinder.java:149)  
    at org.apache.flink.python.api.PythonPlanBinder.main(PythonPlanBinder.java:114)  
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)  
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)  
    at java.lang.reflect.Method.invoke(Method.java:498)  
    at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)  
    at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:419)  
    at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:339)  
    at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:831)  
    at org.apache.flink.client.CliFrontend.run(CliFrontend.java:256)  
    at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1073)  
    at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1120)  
    at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1117)  
    at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)  
    at java.security.AccessController.doPrivileged(Native Method)  
    at javax.security.auth.Subject.doAs(Subject.java:422)  
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)  
    at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)  
    at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1116)  
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:900)  
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:843)  
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:843)  
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)  
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)  
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)  
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)  
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)  
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)  
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)  
    at org.apache.flink.python.api.streaming.data.PythonStreamer.streamBufferWithoutGroups(PythonStreamer.java:252)    
    at org.apache.flink.python.api.functions.PythonMapPartition.mapPartition(PythonMapPartition.java:54)  
    at org.apache.flink.runtime.operators.MapPartitionDriver.run(MapPartitionDriver.java:103)
    at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:490)  
    at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355)  
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)  
    at java.lang.Thread.run(Thread.java:745)
我有一些证据:
1) 这完全不是确定性的,最终应用程序完成时没有错误。

2) 您使用的是哪个版本的flink?python版本也会很有趣。这个异常意味着python进程既没有干净地关闭,也没有出现错误,这很奇怪。明天我将尝试复制这个。如果您愿意自己调试,可以修改catch block PythonStreamer#streamBufferWithoutGroups方法来捕获所有异常。这样,如果python进程中发生错误,则应将其包含在异常消息中。@ChesnaySchepler我使用的是Flink最新稳定版本:1.2.0。Hadoop:2.7.3和Python:3.5.2。@ChesnaySchepler如果您想访问Taskmanager原始日志,我上传到:我的应用程序源代码名为CalculateSimilarity.py,可从git获得:也许我的Python代码引发了一些异常,我没有捕获到它?我将在应用程序的python级别改进异常句柄。如果它不起作用,我将尝试从源代码构建Flink,并在java PythonStreamer上进行修改。整个用户定义的函数执行被包装在一个try-except块中;所有异常都应由Flink处理。但是错误检测/报告有点脆弱:/您使用的是哪个版本的flink?python版本也会很有趣。这个异常意味着python进程既没有干净地关闭,也没有出现错误,这很奇怪。明天我将尝试复制这个。如果您愿意自己调试,可以修改catch block PythonStreamer#streamBufferWithoutGroups方法来捕获所有异常。这样,如果python进程中发生错误,则应将其包含在异常消息中。@ChesnaySchepler我使用的是Flink最新稳定版本:1.2.0。Hadoop:2.7.3和Python:3.5.2。@ChesnaySchepler如果您想访问Taskmanager原始日志,我上传到:我的应用程序源代码名为CalculateSimilarity.py,可从git获得:也许我的Python代码引发了一些异常,我没有捕获到它?我将在应用程序的python级别改进异常句柄。如果它不起作用,我将尝试从源代码构建Flink,并在java PythonStreamer上进行修改。整个用户定义的函数执行被包装在一个try-except块中;所有异常都应由Flink处理。但错误检测/报告有点脆弱:/