Python 设置PySpark_SUBMIT_参数后,PySpark在Jupyter中失败

Python 设置PySpark_SUBMIT_参数后,PySpark在Jupyter中失败,python,apache-spark,pyspark,jupyter-notebook,spark-submit,Python,Apache Spark,Pyspark,Jupyter Notebook,Spark Submit,我正在尝试在Jupyter笔记本中加载Spark(2.2.1)包,否则它可以运行Spark。一旦我加上 %env PYSPARK_SUBMIT_ARGS='--packages com.databricks:spark-redshift_2.10:2.0.1 pyspark-shell' 我在尝试创建上下文时遇到以下错误: --------------------------------------------------------------------------- Exception

我正在尝试在Jupyter笔记本中加载Spark(2.2.1)包,否则它可以运行Spark。一旦我加上

%env PYSPARK_SUBMIT_ARGS='--packages com.databricks:spark-redshift_2.10:2.0.1 pyspark-shell'
我在尝试创建上下文时遇到以下错误:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-5-b25d0ed9494e> in <module>()
----> 1 sc = SparkContext.getOrCreate()
      2 sql_context = SQLContext(sc)

/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py in getOrCreate(cls, conf)
    332         with SparkContext._lock:
    333             if SparkContext._active_spark_context is None:
--> 334                 SparkContext(conf=conf or SparkConf())
    335             return SparkContext._active_spark_context
    336 

/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    113         """
    114         self._callsite = first_spark_call() or CallSite(None, None, None)
--> 115         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    116         try:
    117             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    281         with SparkContext._lock:
    282             if not SparkContext._gateway:
--> 283                 SparkContext._gateway = gateway or launch_gateway(conf)
    284                 SparkContext._jvm = SparkContext._gateway.jvm
    285 

/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/java_gateway.py in launch_gateway(conf)
     93                 callback_socket.close()
     94         if gateway_port is None:
---> 95             raise Exception("Java gateway process exited before sending the driver its port number")
     96 
     97         # In Windows, ensure the Java child processes do not linger after Python has exited.

Exception: Java gateway process exited before sending the driver its port number
---------------------------------------------------------------------------
异常回溯(最后一次最近调用)
在()
---->1 sc=SparkContext.getOrCreate()
2 sql_context=SQLContext(sc)
/getOrCreate(cls,conf)中的usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py
332带SparkContext.\u锁:
333如果SparkContext.\u active\u spark\u上下文为无:
-->334 SparkContext(conf=conf或SparkConf())
335返回SparkContext.\u活动\u spark\u上下文
336
/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py in_u________(self、master、appName、sparkHome、pyFiles、environment、batchSize、serializer、conf、gateway、jsc、profiler_cls)
113         """
114 self.\u callsite=first\u spark\u call()或callsite(无,无,无)
-->115 SparkContext.\u确保\u已初始化(self,gateway=gateway,conf=conf)
116尝试:
117 self.\u do\u init(主程序、appName、sparkHome、pyFiles、环境、batchSize、序列化程序、,
/usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/context.py in_确保_已初始化(cls、实例、网关、配置)
281带SparkContext.\u锁:
282如果不是SparkContext.\u网关:
-->283 SparkContext.\u gateway=网关或启动\u网关(配置)
284 SparkContext.\u jvm=SparkContext.\u gateway.jvm
285
/启动网关(conf)中的usr/local/spark/spark-2.2.1-bin-without-hadoop/python/pyspark/java_gateway.py
93回调_socket.close()
94如果网关_端口为无:
--->95 raise异常(“Java网关进程在发送驱动程序端口号之前退出”)
96
97#在Windows中,确保Java子进程在Python退出后不会停留。
异常:Java网关进程在向驱动程序发送其端口号之前退出
同样,只要未设置
PYSPARK\u SUBMIT\u ARGS
(或仅设置为
PYSPARK shell
),一切正常。只要我添加任何其他内容(例如,如果我将其设置为
--主本地PYSPARK shell
)我发现了这个错误。在谷歌上搜索后,大多数人建议干脆去掉
PYSPARK\u SUBMIT\u ARGS
,因为显而易见的原因,我不能这么做

我也尝试过设置我的
JAVA_HOME
,虽然我不明白为什么这会有什么不同,因为Spark在没有环境变量的情况下工作。我正在使用
Spark submit
pyspark
将工作传递到Jupyter之外的参数


我想我的第一个问题是,有没有办法获得更详细的错误消息?是否有日志文件?当前消息实际上没有告诉我任何信息。

在初始化
SparkContext
之前,按如下方式设置
PYSPARK\u SUBMIT\u ARGS

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-redshift_2.10:2.0.1 pyspark-shell'

您是否尝试过在控制台模式下运行它,即在笔记本之外?是的。相同的参数适用于
spark submit
pyspark
(以及
spark shell
)发现问题。Jupyter在环境变量中包含引号。必须删除这些引号,然后将其删除works@lfk我在笔记本的一开始就使用了
%env PYSPARK\u SUBMIT\u ARGS=--packagesorg.apache.spark:spark-sql-kafka-0-10\u 2.11:2.3.0 PYSPARK shell
,但仍然得到了与我报告的相同的错误:这给了我一个错误:
错误:缺少应用程序资源。