Python pyspark在dataframe.collect()处崩溃,错误消息为;Py4JNetworkError:尝试连接到Java服务器时出错";

Python pyspark在dataframe.collect()处崩溃,错误消息为;Py4JNetworkError:尝试连接到Java服务器时出错";,python,pyspark,spark-dataframe,Python,Pyspark,Spark Dataframe,我有一个函数my_函数,它有一个for循环迭代。for循环中有一部分需要调用dataframe collect()。它适用于前几个循环,但总是在第五次迭代时崩溃。你们知道为什么会这样吗 File "my_code.py", line 189, in my_function my_df_collect = my_df.collect() File "/lib/spark/python/pyspark/sql/dataframe.py", line 280, in collect

我有一个函数my_函数,它有一个for循环迭代。for循环中有一部分需要调用dataframe collect()。它适用于前几个循环,但总是在第五次迭代时崩溃。你们知道为什么会这样吗

  File "my_code.py", line 189, in my_function
    my_df_collect = my_df.collect()
  File "/lib/spark/python/pyspark/sql/dataframe.py", line 280, in collect
    port = self._jdf.collectToPython()
  File "/lib/spark/python/pyspark/traceback_utils.py", line 78, in __exit__
    self._context._jsc.setCallSite(None)
  File "/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 811, in __call__
    answer = self.gateway_client.send_command(command)
  File "/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 624, in send_command
    connection = self._get_connection()
  File "/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 579, in _get_connection
    connection = self._create_connection()
  File "/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 585, in _create_connection
    connection.start()
  File "/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 697, in start
    raise Py4JNetworkError(msg, e)
Py4JNetworkError: An error occurred while trying to connect to the Java server
另一个错误消息

Exception happened during processing of request from ('127.0.0.1', 55584)
Traceback (most recent call last):
  File "/anaconda/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/anaconda/lib/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/anaconda/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/anaconda/lib/python2.7/SocketServer.py", line 655, in __init__
    self.handle()
  File "/lib/spark/python/pyspark/accumulators.py", line 235, in handle
    num_updates = read_int(self.rfile)
  File "/lib/spark/python/pyspark/serializers.py", line 545, in read_int
    raise EOFError
EOFError

也许JVM溢出了。尝试向驱动程序添加内存,或运行
df.take(10)
而不是
df.collect()
,以测试问题是否在于返回的数据量。

可能JVM溢出。尝试向驱动程序添加内存,或运行
df.take(10)
而不是
df.collect()
,以测试问题是否在于返回的数据量。

谢谢您的建议。最后,我生成一个流程,并为每个循环实例化一个新的sparkContext。有内存泄漏的地方,这是很难检测到,因为我的火花崩溃后,几百个工作你是对的!我的函数中有太多df.collect(),导致Spark崩溃。谢谢你的建议。最后,我生成一个流程,并为每个循环实例化一个新的sparkContext。有内存泄漏的地方,这是很难检测到,因为我的火花崩溃后,几百个工作你是对的!函数中的df.collect()太多,导致Spark崩溃。