Apache spark 将PySpark与Jupyter笔记本集成
下面我将安装Jupyter笔记本电脑、PySpark并将两者集成 当我需要创建“Jupyter配置文件”时,我读到“Jupyter配置文件”不再存在。所以我继续执行以下几行Apache spark 将PySpark与Jupyter笔记本集成,apache-spark,ipython,pyspark,jupyter,jupyter-notebook,Apache Spark,Ipython,Pyspark,Jupyter,Jupyter Notebook,下面我将安装Jupyter笔记本电脑、PySpark并将两者集成 当我需要创建“Jupyter配置文件”时,我读到“Jupyter配置文件”不再存在。所以我继续执行以下几行 $ mkdir -p ~/.ipython/kernels/pyspark $ touch ~/.ipython/kernels/pyspark/kernel.json 我打开了kernel.json,并编写了以下内容: { "display_name": "pySpark", "language": "python
$ mkdir -p ~/.ipython/kernels/pyspark
$ touch ~/.ipython/kernels/pyspark/kernel.json
我打开了kernel.json
,并编写了以下内容:
{
"display_name": "pySpark",
"language": "python",
"argv": [
"/usr/bin/python",
"-m",
"IPython.kernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7",
"PYTHONPATH": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python:/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "pyspark-shell"
}
}
火花的路径是正确的
但是,当我运行jupyter控制台--kernel pyspark时,我得到以下输出:
MacBook:~ Agus$ jupyter console --kernel pyspark
/usr/bin/python: No module named IPython
Traceback (most recent call last):
File "/usr/local/bin/jupyter-console", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python2.7/site-packages/jupyter_core/application.py", line 267, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 595, in launch_instance
app.initialize(argv)
File "<decorator-gen-113>", line 2, in initialize
File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 74, in catch_config_error
return method(app, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/jupyter_console/app.py", line 137, in initialize
self.init_shell()
File "/usr/local/lib/python2.7/site-packages/jupyter_console/app.py", line 110, in init_shell
client=self.kernel_client,
File "/usr/local/lib/python2.7/site-packages/traitlets/config/configurable.py", line 412, in instance
inst = cls(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/jupyter_console/ptshell.py", line 251, in __init__
self.init_kernel_info()
File "/usr/local/lib/python2.7/site-packages/jupyter_console/ptshell.py", line 305, in init_kernel_info
raise RuntimeError("Kernel didn't respond to kernel_info_request")
RuntimeError: Kernel didn't respond to kernel_info_request
MacBook:~Agus$jupyter控制台——内核pyspark
/usr/bin/python:没有名为IPython的模块
回溯(最近一次呼叫最后一次):
文件“/usr/local/bin/jupyter console”,第11行,在
sys.exit(main())
文件“/usr/local/lib/python2.7/site packages/jupyter_core/application.py”,第267行,在launch_实例中
返回super(JupyterApp,cls)。启动_实例(argv=argv,**kwargs)
文件“/usr/local/lib/python2.7/site packages/traitlets/config/application.py”,第595行,在launch_实例中
应用程序初始化(argv)
文件“”,第2行,在初始化中
catch_config_error中的文件“/usr/local/lib/python2.7/site packages/traitlets/config/application.py”,第74行
返回方法(应用程序、*args、**kwargs)
文件“/usr/local/lib/python2.7/site packages/jupyter_console/app.py”,第137行,在初始化中
self.init_shell()
文件“/usr/local/lib/python2.7/site packages/jupyter_console/app.py”,第110行,在init_shell中
client=self.kernel\u client,
文件“/usr/local/lib/python2.7/site packages/traitlets/config/configurable.py”,实例中的第412行
inst=cls(*args,**kwargs)
文件“/usr/local/lib/python2.7/site packages/jupyter_console/ptshell.py”,第251行,在__
self.init_内核_信息()
文件“/usr/local/lib/python2.7/site packages/jupyter_console/ptshell.py”,第305行,在init_kernel_info中
raise RUNTIMERROR(“内核未响应内核信息请求”)
运行时错误:内核没有响应内核信息请求
最简单的方法是使用findspark。首先创建一个环境变量:
export SPARK_HOME="{full path to Spark}"
然后安装FindPark:
pip install findspark
然后启动jupyter notebook,以下各项应能正常工作:
import findspark
findspark.init()
import pyspark
将pyspark与jupyter笔记本集成的多种方法。
1.安装
pip install jupyter
pip install toree
jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark
您可以通过以下方式检查安装:
jupyter kernelspec list
您将获得toree pyspark内核的条目
apache_toree_pyspark /home/pauli/.local/share/jupyter/kernels/apache_toree_pyspark
之后,如果您愿意,您可以安装其他intepreter,如SparkR、Scala、SQL
jupyter toree install --interpreters=Scala,SparkR,SQL
2.将这些行添加到bashrc
export SPARK_HOME=/path to /spark-2.2.0
export PATH="$PATH:$SPARK_HOME/bin"
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
在终端中键入pyspark
,它将打开一个初始化了sparkcontext的jupyter笔记本
pyspark
pip安装pyspark
现在,您可以像导入另一个python包一样导入pyspark