Apache spark 从本地jupyter笔记本连接到spark群集

Apache spark 从本地jupyter笔记本连接到spark群集,apache-spark,pyspark,jupyter-notebook,py4j,Apache Spark,Pyspark,Jupyter Notebook,Py4j,我尝试从本地机器上的笔记本连接到远程spark master 当我尝试创建sparkContext时 sc = pyspark.SparkContext(master = "spark://remote-spark-master-hostname:7077", appName="jupyter notebook_test"), 我得到以下例外: /opt/.venv/lib/python3.7/site-packages/pyspark/c

我尝试从本地机器上的笔记本连接到远程spark master

当我尝试创建sparkContext时

sc = pyspark.SparkContext(master = "spark://remote-spark-master-hostname:7077", 
                          appName="jupyter notebook_test"),
我得到以下例外:

/opt/.venv/lib/python3.7/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    134         try:
    135             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
--> 136                           conf, jsc, profiler_cls)
    137         except:
    138             # If an error occurs, clean up in order to allow future SparkContext creation:

/opt/.venv/lib/python3.7/site-packages/pyspark/context.py in _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls)
    196 
    197         # Create the Java SparkContext through Py4J
--> 198         self._jsc = jsc or self._initialize_context(self._conf._jconf)
    199         # Reset the SparkConf to the one actually used by the SparkContext in JVM.
    200         self._conf = SparkConf(_jconf=self._jsc.sc().conf())

/opt/.venv/lib/python3.7/site-packages/pyspark/context.py in _initialize_context(self, jconf)
    304         Initialize SparkContext in function to allow subclass specific initialization
    305         """
--> 306         return self._jvm.JavaSparkContext(jconf)
    307 
    308     @classmethod

/opt/.venv/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1523         answer = self._gateway_client.send_command(command)
   1524         return_value = get_return_value(
-> 1525             answer, self._gateway_client, None, self._fqn)
   1526 
   1527         for temp_arg in temp_args:

/opt/.venv/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
    at scala.Predef$.require(Predef.scala:224)
    at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:516)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:745)

/opt/.venv/lib/python3.7/site-packages/pyspark/context.py in uuuu init_uuu(self、master、appName、sparkHome、pyFiles、environment、batchSize、serializer、conf、gateway、jsc、profiler\u cls)
134尝试:
135 self._do_init(主程序、应用程序名、sparkHome、pyFiles、环境、批大小、序列化程序、,
-->136形态、jsc、探查器(cls)
137除了:
138#如果发生错误,请清理以允许将来创建SparkContext:
/opt/.venv/lib/python3.7/site-packages/pyspark/context.py in_do_init(self、master、appName、sparkHome、pyFiles、environment、batchSize、serializer、conf、jsc、profiler_cls)
196
197#通过Py4J创建Java SparkContext
-->198 self.\u jsc=jsc或self.\u初始化上下文(self.\u conf.\u jconf)
199#将SparkConf重置为JVM中SparkContext实际使用的一个。
200 self.\u conf=SparkConf(\u jconf=self.\u jsc.sc().conf())
/opt/.venv/lib/python3.7/site-packages/pyspark/context.py在_initialize_context(self,jconf)中
304在函数中初始化SparkContext以允许子类特定的初始化
305         """
-->306返回self._jvm.JavaSparkContext(jconf)
307
308@classmethod
/opt/.venv/lib/python3.7/site-packages/py4j/java_gateway.py in.\uu调用(self,*args)
1523 answer=self.\u gateway\u client.send\u命令(command)
1524返回值=获取返回值(
->1525回答,自。\网关\客户端,无,自。\ fqn)
1526
1527对于临时参数中的临时参数:
/获取返回值中的opt/.venv/lib/python3.7/site-packages/py4j/protocol.py(应答、网关客户端、目标id、名称)
326 raise Py4JJavaError(
327“调用{0}{1}{2}时出错。\n”。
-->328格式(目标id,“.”,名称),值)
329其他:
330升起Py4JError(
Py4JJavaError:调用None.org.apache.spark.api.java.JavaSparkContext时出错。
:java.lang.IllegalArgumentException:需求失败:只能在运行的MetricsSystem上调用GetServletHandler
在scala.Predef$.require处(Predef.scala:224)
位于org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
位于org.apache.spark.SparkContext(SparkContext.scala:516)
位于org.apache.spark.api.java.JavaSparkContext(JavaSparkContext.scala:58)
位于sun.reflect.NativeConstructorAccessorImpl.newInstance0(本机方法)
位于sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
在sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
位于java.lang.reflect.Constructor.newInstance(Constructor.java:423)
位于py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
位于py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
在py4j.Gateway.invoke处(Gateway.java:238)
位于py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
在py4j.commands.ConstructorCommand.execute处(ConstructorCommand.java:69)
在py4j.GatewayConnection.run处(GatewayConnection.java:238)
运行(Thread.java:745)
同时,我可以在交互模式下使用相同的解释器创建spark上下文


如何从本地jupyter笔记本连接到远程spark master?

我使用@hristoilev建议解决了问题。 在我的例子中,
PYSPARK\u PYTHON
不是在jupyter环境中设置的。简单的解决方案:

import os
os.environ["PYSPARK_PYTHON"] = '/opt/.venv/bin/python'
os.environ["SPARK_HOME"] = '/opt/spark'

您也可以使用它,但我没有测试它。

如果PySpark无法与主机通信,通常会发生这种情况。请确保主机名正确,并且您在环境中正确设置了
SPARK\u HOME
PySpark\u Pyon
。本地和远程SPARK版本不匹配也可能导致该错误。我知道我的工作站和群集上有相同的火花(2.4.5)。我已经设置了PYSPARK_PYTHON和SPARK_HOME。它可以帮助我使用PYTHON连接到集群,但我无法使用notebook@HristoIliev实现这一点。也许我应该为jupyter设置特殊设置?在单机版和笔记本中打印
PYTHON
的值,并查找差异。非常感谢!我的笔记本environ不包含SPARK_HOME。您可以使用它来简化流程-它为您设置所有环境变量。