Ipython笔记本中的pyspark引发Py4JNetworkError

Ipython笔记本中的pyspark引发Py4JNetworkError,ipython,ipython-notebook,pyspark,Ipython,Ipython Notebook,Pyspark,我使用IPython笔记本运行PySpark,只是在笔记本中添加了以下内容: import os os.chdir('../data_files') import sys import pandas as pd %pylab inline from IPython.display import Image os.environ['SPARK_HOME']="spark-1.3.1-bin-hadoop2.6" sys.path.append( os.path.join(os.environ['S

我使用IPython笔记本运行PySpark,只是在笔记本中添加了以下内容:

import os
os.chdir('../data_files')
import sys
import pandas as pd
%pylab inline
from IPython.display import Image
os.environ['SPARK_HOME']="spark-1.3.1-bin-hadoop2.6"
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'bin') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python/lib/py4j-0.8.2.1-src.zip') )
from pyspark import SparkContext
sc = SparkContext('local')
这在一个项目中效果很好。但是在我的第二个项目中,在运行了几行之后(不是每次都是相同的),我得到了以下错误:

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.py", line 425, in start
    self.socket.connect((self.address, self.port))
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
---------------------------------------------------------------------------
Py4JNetworkError                          Traceback (most recent call last)
<ipython-input-21-4626925bbe8f> in <module>()
----> 1 words.count()

/home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.pyc in count(self)
    930         3
    931         """
--> 932         return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
    933 
    934     def stats(self):

/home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.pyc in sum(self)
    921         6.0
    922         """
--> 923         return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
    924 
    925     def count(self):

/home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.pyc in reduce(self, f)
    737             yield reduce(f, iterator, initial)
    738 
--> 739         vals = self.mapPartitions(func).collect()
    740         if vals:
    741             return reduce(f, vals)

/home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.pyc in collect(self)
    710         Return a list that contains all of the elements in this RDD.
    711         """
--> 712         with SCCallSiteSync(self.context) as css:
    713             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    714         return list(_load_from_socket(port, self._jrdd_deserializer))

/home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/traceback_utils.pyc in __enter__(self)
     70     def __enter__(self):
     71         if SCCallSiteSync._spark_stack_depth == 0:
---> 72             self._context._jsc.setCallSite(self._call_site)
     73         SCCallSiteSync._spark_stack_depth += 1
     74 

/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc in __call__(self, *args)
    534             END_COMMAND_PART
    535 
--> 536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
    538                 self.target_id, self.name)

/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc in send_command(self, command, retry)
    360          the Py4J protocol.
    361         """
--> 362         connection = self._get_connection()
    363         try:
    364             response = connection.send_command(command)

/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc in _get_connection(self)
    316             connection = self.deque.pop()
    317         except Exception:
--> 318             connection = self._create_connection()
    319         return connection
    320 

/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc in _create_connection(self)
    323         connection = GatewayConnection(self.address, self.port,
    324                 self.auto_close, self.gateway_property)
--> 325         connection.start()
    326         return connection
    327 

/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc in start(self)
    430                 'server'
    431             logger.exception(msg)
--> 432             raise Py4JNetworkError(msg)
    433 
    434     def close(self):

Py4JNetworkError: An error occurred while trying to connect to the Java server
ERROR:py4j.java_网关:尝试连接到java服务器时出错
回溯(最近一次呼叫最后一次):
文件“/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_-gateway.py”,第425行,开头
self.socket.connect((self.address,self.port))
文件“/usr/lib/python2.7/socket.py”,第224行,meth格式
返回getattr(self.\u sock,name)(*args)
错误:[Errno 111]连接被拒绝
---------------------------------------------------------------------------
Py4JNetworkError回溯(上次最近的调用)
在()
---->1个字。计数()
/home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.pyc in count(self)
930         3
931         """
-->932返回self.mapPartitions(lambda i:[sum(1代表i中的u)]).sum()
933
934 def状态(自身):
/home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.pyc in sum(self)
921         6.0
922         """
-->923返回self.mapPartitions(lambda x:[求和(x)]).reduce(运算符.add)
924
925 def计数(自身):
/reduce(self,f)中的home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.pyc
737产量缩减(f,迭代器,初始)
738
-->739 VAL=self.mapPartitions(func.collect())
740如果VAL:
741返回减少(f,VAL)
/home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.pyc in collect(self)
710返回包含此RDD中所有元素的列表。
711         """
-->712使用SCCallSiteSync(self.context)作为css:
713 port=self.ctx.\u jvm.PythonRDD.collectAndServe(self.\u jrdd.rdd())
714返回列表(_从_套接字加载(端口,self._jrdd_反序列化器))
/home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/traceback\u utils.pyc in\uuuuu输入(self)
70定义输入(自我):
71如果SCCallSiteSync.\u spark\u stack\u depth==0:
--->72 self.\u context.\u jsc.setCallSite(self.\u call\u site)
73 SCCallSiteSync.\u spark\u stack\u depth+=1
74
/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc in_u__调用(self,*args)
534结束命令部分
535
-->536 answer=self.gateway\u client.send\u命令(command)
537返回值=获取返回值(应答,self.gateway\u客户端,
538 self.target_id,self.name)
/send_命令中的usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc(self,command,retry)
360 Py4J协议。
361         """
-->362连接=self.\u获取\u连接()
363尝试:
364响应=连接。发送_命令(命令)
/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc in_get_连接(self)
316连接=self.deque.pop()
317例外情况除外:
-->318连接=self.\u创建\u连接()
319回路连接
320
/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc in_create_connection(self)
323连接=网关连接(自地址、自端口、,
324 self.auto\u关闭,self.gateway\u属性)
-->325连接。开始()
326回路连接
327
/启动时的usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc(self)
430“服务器”
431记录器。异常(msg)
-->432提升Py4JNetworkError(msg)
433
434 def关闭(自):
Py4JNetworkError:尝试连接到Java服务器时出错
一旦发生这种情况,以前的其他生产线也会出现同样的问题, 有什么想法吗?

以下方面的规范:

  • pyspark 1.4.1

  • ipython 4.0.0

  • [OSX/自制]

如果您想使用iPython内核在Jupyter(例如iPython)笔记本中启动pyspark,我建议您直接使用pyspark命令启动笔记本:

>>>pyspark
但要做到这一点,您需要在bash.profile或zsh.zshrc profile中添加三行来设置这些环境变量:

export SPARK_HOME=/path/to/apache-spark/1.4.1/libexec
export PYSPARK_DRIVER_PYTHON=ipython2 # remember that Apache-Spark only works with pyhton2.7
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
在我的例子中,考虑到我在OSX上,这是一个安装了apache spark和自制软件的系统,这是:

export SPARK_HOME=/usr/local/Cellar/apache-spark/1.4.1/libexec
export PYSPARK_DRIVER_PYTHON=ipython2
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
然后,当您在终端中执行命令“pyspark”时,终端将在默认浏览器中自动打开一个Jupyter(例如iPython)笔记本

>>>pyspark
I 17:51:00.209 NotebookApp] Serving notebooks from local directory: /Users/Thibault/code/kaggle
[I 17:51:00.209 NotebookApp] 0 active kernels
[I 17:51:00.210 NotebookApp] The IPython Notebook is running at: http://localhost:42424/
[I 17:51:00.210 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 17:51:11.980 NotebookApp] Kernel started: 53ad11b1-4fa4-459d-804c-0487036b0f29
15/09/02 17:51:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable