Import 使用pycharm将外部jar包含到pyspark中

Import 使用pycharm将外部jar包含到pyspark中,import,pyspark,jar,pycharm,Import,Pyspark,Jar,Pycharm,我在尝试将com.databricks:spark-xml_2.10:0.4.1包含到pycharm中的pyspark代码时遇到了一个问题 import pyspark from pyspark.shell import sc from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import SparkSession from pyspark.sql.types import *

我在尝试将com.databricks:spark-xml_2.10:0.4.1包含到pycharm中的pyspark代码时遇到了一个问题

import pyspark
from pyspark.shell import sc
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import SparkSession

import os

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell"
)



if __name__ == '__main__':


 df = sqlContext.read.format('org.apache.spark.sql.xml') \
        .option('rowTag', 'lei:Extension')
.load('C:\\Users\\Consultant\\Desktop\\20170501-gleif-concatenated-file'
                                                '-lei2.xml')
    df.show()
但它的回报是

Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/C:/spark-2.4.5-bin-hadoop2.7/python/dependency
    at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
    at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:221)
    at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
    at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.<init>(SparkSubmit.scala:907)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:907)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
  File "C:/spark-2.4.5-bin-hadoop2.7/python/test.py", line 2, in <module>
    from pyspark.shell import sc
  File "C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\shell.py", line 38, in <module>
    SparkContext._ensure_initialized()
  File "C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\context.py", line 316, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\java_gateway.py", line 46, in launch_gateway
    return _launch_gateway(conf)
  File "C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\java_gateway.py", line 108, in _launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
线程“main”org.apache.spark.sparkeexception中的异常:无法从JAR文件加载主类:/C:/spark-2.4.5-bin-hadoop2.7/python/dependency 位于org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657) 位于org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:221) 位于org.apache.spark.deploy.SparkSubmitArguments。(SparkSubmitArguments.scala:116) 位于org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1。(SparkSubmit.scala:907) 位于org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:907) 位于org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81) 位于org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 位于org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) 位于org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 回溯(最近一次呼叫最后一次): 文件“C:/spark-2.4.5-bin-hadoop2.7/python/test.py”,第2行,在 从pyspark.shell导入sc 文件“C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\shell.py”,第38行,在 SparkContext.\u确保\u已初始化() 文件“C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\context.py”,第316行,在\u确保\u初始化 SparkContext.\u gateway=网关或启动\u网关(配置) 文件“C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\java\u gateway.py”,第46行,在launch\u gateway中 返回\u启动\u网关(配置) 文件“C:\spark-2.4.5-bin-hadoop2.7\python\pyspark\java\u gateway.py”,第108行,在\u launch\u gateway中 引发异常(“Java网关进程在发送端口号之前退出”) 异常:Java网关进程在发送其端口号之前退出 我想直接在pycharm中添加外部jar。这可能吗


提前感谢。

您应该将Environment变量设置为脚本的第一步:

import os
os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages com.databricks:spark-xml_2.10:0.4.1"
)

import pyspark
...
然后,如果要对运行的任何脚本执行此操作,请使用pycharm的
runconfigurations
。可以按照以下步骤添加模板:

  • 转到
    编辑配置
  • 在模板中,编辑python模板
  • 添加一个
    环境值
    PYSPARK\u SUBMIT\u ARGS=“--packages com.databricks:spark-xml\u 2.10:0.4.1”
希望能有帮助