Import 使用pycharm将外部jar包含到pyspark中
我在尝试将com.databricks:spark-xml_2.10:0.4.1包含到pycharm中的pyspark代码时遇到了一个问题Import 使用pycharm将外部jar包含到pyspark中,import,pyspark,jar,pycharm,Import,Pyspark,Jar,Pycharm,我在尝试将com.databricks:spark-xml_2.10:0.4.1包含到pycharm中的pyspark代码时遇到了一个问题 import pyspark from pyspark.shell import sc from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import SparkSession from pyspark.sql.types import *
import pyspark
from pyspark.shell import sc
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import SparkSession
import os
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell"
)
if __name__ == '__main__':
df = sqlContext.read.format('org.apache.spark.sql.xml') \
.option('rowTag', 'lei:Extension')
.load('C:\\Users\\Consultant\\Desktop\\20170501-gleif-concatenated-file'
'-lei2.xml')
df.show()
但它的回报是
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/C:/spark-2.4.5-bin-hadoop2.7/python/dependency
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:221)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.<init>(SparkSubmit.scala:907)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:907)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "C:/spark-2.4.5-bin-hadoop2.7/python/test.py", line 2, in <module>
from pyspark.shell import sc
File "C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\shell.py", line 38, in <module>
SparkContext._ensure_initialized()
File "C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\context.py", line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\java_gateway.py", line 46, in launch_gateway
return _launch_gateway(conf)
File "C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\java_gateway.py", line 108, in _launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
线程“main”org.apache.spark.sparkeexception中的异常:无法从JAR文件加载主类:/C:/spark-2.4.5-bin-hadoop2.7/python/dependency
位于org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
位于org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:221)
位于org.apache.spark.deploy.SparkSubmitArguments。(SparkSubmitArguments.scala:116)
位于org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1。(SparkSubmit.scala:907)
位于org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:907)
位于org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
位于org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
位于org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
位于org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
回溯(最近一次呼叫最后一次):
文件“C:/spark-2.4.5-bin-hadoop2.7/python/test.py”,第2行,在
从pyspark.shell导入sc
文件“C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\shell.py”,第38行,在
SparkContext.\u确保\u已初始化()
文件“C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\context.py”,第316行,在\u确保\u初始化
SparkContext.\u gateway=网关或启动\u网关(配置)
文件“C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\java\u gateway.py”,第46行,在launch\u gateway中
返回\u启动\u网关(配置)
文件“C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\java\u gateway.py”,第108行,在\u launch\u gateway中
引发异常(“Java网关进程在发送端口号之前退出”)
异常:Java网关进程在发送其端口号之前退出
我想直接在pycharm中添加外部jar。这可能吗
提前感谢。您应该将Environment变量设置为脚本的第一步:
import os
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages com.databricks:spark-xml_2.10:0.4.1"
)
import pyspark
...
然后,如果要对运行的任何脚本执行此操作,请使用pycharm的runconfigurations
。可以按照以下步骤添加模板:
- 转到
编辑配置
- 在模板中,编辑python模板
- 添加一个
如环境值
PYSPARK\u SUBMIT\u ARGS=“--packages com.databricks:spark-xml\u 2.10:0.4.1”