Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/matlab/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何将PyCharm与PySpark链接?_Python_Apache Spark_Pyspark_Pycharm_Homebrew - Fatal编程技术网

Python 如何将PyCharm与PySpark链接?

Python 如何将PyCharm与PySpark链接?,python,apache-spark,pyspark,pycharm,homebrew,Python,Apache Spark,Pyspark,Pycharm,Homebrew,我是apache spark的新手,显然我在我的macbook中安装了带有自制软件的apache spark: Last login: Fri Jan 8 12:52:04 on console user@MacBook-Pro-de-User-2:~$ pyspark Python 2.7.10 (default, Jul 13 2015, 12:05:58) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin

我是apache spark的新手,显然我在我的macbook中安装了带有自制软件的apache spark:

Last login: Fri Jan  8 12:52:04 on console
user@MacBook-Pro-de-User-2:~$ pyspark
Python 2.7.10 (default, Jul 13 2015, 12:05:58)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/01/08 14:46:44 INFO SparkContext: Running Spark version 1.5.1
16/01/08 14:46:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/08 14:46:47 INFO SecurityManager: Changing view acls to: user
16/01/08 14:46:47 INFO SecurityManager: Changing modify acls to: user
16/01/08 14:46:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)
16/01/08 14:46:50 INFO Slf4jLogger: Slf4jLogger started
16/01/08 14:46:50 INFO Remoting: Starting remoting
16/01/08 14:46:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.64:50199]
16/01/08 14:46:51 INFO Utils: Successfully started service 'sparkDriver' on port 50199.
16/01/08 14:46:51 INFO SparkEnv: Registering MapOutputTracker
16/01/08 14:46:51 INFO SparkEnv: Registering BlockManagerMaster
16/01/08 14:46:51 INFO DiskBlockManager: Created local directory at /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/blockmgr-769e6f91-f0e7-49f9-b45d-1b6382637c95
16/01/08 14:46:51 INFO MemoryStore: MemoryStore started with capacity 530.0 MB
16/01/08 14:46:52 INFO HttpFileServer: HTTP File server directory is /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/spark-8e4749ea-9ae7-4137-a0e1-52e410a8e4c5/httpd-1adcd424-c8e9-4e54-a45a-a735ade00393
16/01/08 14:46:52 INFO HttpServer: Starting HTTP Server
16/01/08 14:46:52 INFO Utils: Successfully started service 'HTTP file server' on port 50200.
16/01/08 14:46:52 INFO SparkEnv: Registering OutputCommitCoordinator
16/01/08 14:46:52 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/01/08 14:46:52 INFO SparkUI: Started SparkUI at http://192.168.1.64:4040
16/01/08 14:46:53 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/01/08 14:46:53 INFO Executor: Starting executor ID driver on host localhost
16/01/08 14:46:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50201.
16/01/08 14:46:53 INFO NettyBlockTransferService: Server created on 50201
16/01/08 14:46:53 INFO BlockManagerMaster: Trying to register BlockManager
16/01/08 14:46:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:50201 with 530.0 MB RAM, BlockManagerId(driver, localhost, 50201)
16/01/08 14:46:53 INFO BlockManagerMaster: Registered BlockManager
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

Using Python version 2.7.10 (default, Jul 13 2015 12:05:58)
SparkContext available as sc, HiveContext available as sqlContext.
>>>
我想开始玩,以了解更多关于MLlib。但是,我使用Pycharm用python编写脚本。问题是:当我转到Pycharm并尝试调用pyspark时,Pycharm找不到模块。我尝试将路径添加到Pycharm,如下所示:

然后从一个地方我试了一下:

import os
import sys

# Path for spark source folder
os.environ['SPARK_HOME']="/Users/user/Apps/spark-1.5.2-bin-hadoop2.4"

# Append pyspark  to Python Path
sys.path.append("/Users/user/Apps/spark-1.5.2-bin-hadoop2.4/python/pyspark")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf
    print ("Successfully imported Spark Modules")

except ImportError as e:
    print ("Can not import Spark Modules", e)
    sys.exit(1)
而且仍然无法开始将PySpark与Pycharm一起使用,您知道如何将Pycharm与ApachePySpark“链接”在一起吗

更新:

然后我搜索apache spark和python path,以设置Pycharm的环境变量:

apache spark路径:

user@MacBook-Pro-User-2:~$ brew info apache-spark
apache-spark: stable 1.6.0, HEAD
Engine for large-scale data processing
https://spark.apache.org/
/usr/local/Cellar/apache-spark/1.5.1 (649 files, 302.9M) *
  Poured from bottle
From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb
python路径:

user@MacBook-Pro-User-2:~$ brew info python
python: stable 2.7.11 (bottled), HEAD
Interpreted, interactive, object-oriented programming language
https://www.python.org
/usr/local/Cellar/python/2.7.10_2 (4,965 files, 66.9M) *
然后,根据上述信息,我尝试如下设置环境变量:

知道如何将Pycharm与pyspark正确链接吗?

然后,当我使用上述配置运行python脚本时,出现了以下异常:

/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/spark_examples/test_1.py
Traceback (most recent call last):
  File "/Users/user/PycharmProjects/spark_examples/test_1.py", line 1, in <module>
    from pyspark import SparkContext
ImportError: No module named pyspark

输出:

配置2:

/usr/local/Cellar/apache-spark/1.5.1/libexec 

输出:

从:

要在Python中运行Spark应用程序,请使用bin/Spark提交脚本 位于Spark目录中。此脚本将加载Spark的 Java/Scala库,允许您向 簇您还可以使用bin/pyspark启动交互式Python 贝壳

您直接使用CPython解释器调用脚本,我认为这会造成问题

尝试使用以下命令运行脚本:

"${SPARK_HOME}"/bin/spark-submit test_1.py
如果可行,您应该能够通过将项目的解释器设置为spark submit使其在PyCharm中工作。

使用PySpark软件包(spark 2.2.0及更高版本) 合并后,您应该能够通过
pip
在用于PyCharm开发的环境中安装Spark来简化流程

  • 转到文件->设置->项目解释器
  • 单击安装按钮并搜索PySpark

  • 单击InstallPackage按钮

  • 使用用户提供的Spark安装手动进行 创建运行配置

  • 转到运行->编辑配置
  • 添加新的Python配置
  • 设置脚本路径,使其指向要执行的脚本
  • 编辑环境变量字段,使其至少包含:

    • SPARK\u HOME
      -它应该指向安装了SPARK的目录。它应该包含诸如
      bin
      (带
      spark submit
      spark shell
      等)和
      conf
      (带
      spark defaults.conf
      spark env.sh
      等)等目录
    • PYTHONPATH
      -它应该包含
      $SPARK\u HOME/python
      ,如果没有其他版本,还可以选择
      $SPARK\u HOME/python/lib/py4j某些版本的.src.zip
      <代码>某些版本应与给定火花装置使用的Py4J版本相匹配(0.8.2.1-1.5、0.9-1.6、0.10.3-2.0、0.10.4-2.1、0.10.4-2.2、0.10.6-2.3、0.10.7-2.4)

  • 应用设置

  • 将PySpark库添加到解释器路径(代码完成所需)

  • 转到文件->设置->项目解释器
  • 打开要与Spark一起使用的解释器的设置
  • 编辑解释器路径,使其包含指向
    $SPARK\u HOME/python
    (需要时使用Py4J)的路径
  • 保存设置
  • 选择性地
  • 安装或添加到路径匹配已安装的Spark版本,以获得更好的完成和静态错误检测(免责声明-我是该项目的作者)
  • 最后
    使用新创建的配置来运行您的脚本。

    我使用以下页面作为参考,并能够在PyCharm 5中导入pyspark/Spark 1.6.1(通过自制安装)

    通过以上步骤,pyspark将加载,但当我尝试创建SparkContext时,会出现网关错误。自制的Spark出现了一些问题,所以我从Spark网站(下载为Hadoop 2.6和更高版本预先构建的版本)抓取了Spark,并指向该网站下的Spark和py4j目录。这是pycharm中有效的代码

    import os
    import sys
    
    # Path for spark source folder
    os.environ['SPARK_HOME']="/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6"
    
    # Need to Explicitly point to python3 if you are using Python 3.x
    os.environ['PYSPARK_PYTHON']="/usr/local/Cellar/python3/3.5.1/bin/python3"
    
    #You might need to enter your local IP
    #os.environ['SPARK_LOCAL_IP']="192.168.2.138"
    
    #Path for pyspark and py4j
    sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python")
    sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip")
    
    try:
        from pyspark import SparkContext
        from pyspark import SparkConf
        print ("Successfully imported Spark Modules")
    except ImportError as e:
        print ("Can not import Spark Modules", e)
        sys.exit(1)
    
    sc = SparkContext('local')
    words = sc.parallelize(["scala","java","hadoop","spark","akka"])
    print(words.count())
    
    我从这些说明中得到了很多帮助,这些说明帮助我在PyDev中进行故障排除,然后使其正常工作


    我敢肯定,有人花了几个小时的时间用头猛击显示器,试图让它工作,所以希望这有助于挽救他们的理智

    下面是我如何在mac osx上解决这个问题的

  • brew安装apache spark
  • 将此添加到~/.bash_配置文件

    export SPARK_VERSION=`ls /usr/local/Cellar/apache-spark/ | sort | tail -1`
    export SPARK_HOME="/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec"
    export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
    export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
    
  • 将pyspark和py4j添加到内容根目录(使用正确的Spark版本):

  • 退房

    假设spark python目录为:
    /home/user/spark/python

    假设您的Py4j源代码是:
    /home/user/spark/python/lib/Py4j-0.9-src.zip

    基本上,将spark python目录和其中的py4j目录添加到解释器路径中。我没有足够的声誉来发布截图,否则我会的

    在视频中,用户在pycharm自身内创建虚拟环境,但是,您可以在pycharm之外创建虚拟环境或激活预先存在的虚拟环境,然后启动pycharm,并从pycharm内将这些路径添加到虚拟环境解释器路径


    我使用了其他方法通过bash环境变量添加spark,这在pycharm之外非常有效,但由于某些原因,pycharm中无法识别spark,但这种方法非常有效。

    我在线遵循教程,将环境变量添加到.bashrc:

    # add pyspark to python
    export SPARK_HOME=/home/lolo/spark-1.6.1
    export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
    export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
    
    然后我在SPARK_HOME和PYTHONPATH中得到了pycharm的值:

    (srz-reco)lolo@K:~$ echo $SPARK_HOME 
    /home/lolo/spark-1.6.1
    (srz-reco)lolo@K:~$ echo $PYTHONPATH
    /home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/python/lib/py4j-0.8.2.1-src.zip:/python/:
    
    然后我将其复制到的Run/Debug Configurations->Environment variables中
    import os
    import sys
    
    # Path for spark source folder
    os.environ['SPARK_HOME']="/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6"
    
    # Need to Explicitly point to python3 if you are using Python 3.x
    os.environ['PYSPARK_PYTHON']="/usr/local/Cellar/python3/3.5.1/bin/python3"
    
    #You might need to enter your local IP
    #os.environ['SPARK_LOCAL_IP']="192.168.2.138"
    
    #Path for pyspark and py4j
    sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python")
    sys.path.append("/Users/myUser/Downloads/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip")
    
    try:
        from pyspark import SparkContext
        from pyspark import SparkConf
        print ("Successfully imported Spark Modules")
    except ImportError as e:
        print ("Can not import Spark Modules", e)
        sys.exit(1)
    
    sc = SparkContext('local')
    words = sc.parallelize(["scala","java","hadoop","spark","akka"])
    print(words.count())
    
    export SPARK_VERSION=`ls /usr/local/Cellar/apache-spark/ | sort | tail -1`
    export SPARK_HOME="/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec"
    export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
    export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
    
    /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip
    /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip
    
    # add pyspark to python
    export SPARK_HOME=/home/lolo/spark-1.6.1
    export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
    export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
    
    (srz-reco)lolo@K:~$ echo $SPARK_HOME 
    /home/lolo/spark-1.6.1
    (srz-reco)lolo@K:~$ echo $PYTHONPATH
    /home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/home/lolo/spark-1.6.1/python/lib/py4j-0.9-src.zip:/home/lolo/spark-1.6.1/python/:/python/lib/py4j-0.8.2.1-src.zip:/python/:
    
    PYTHONPATH=%PYTHONPATH%;{py4j};{spark python}
    
    export PYTHONPATH=${PYTHONPATH};{py4j};{spark/python}
    
    File menu - settings - project interpreter - (gearshape) - more - (treebelowfunnel) - (+) - [add python folder form spark installation and then py4j-*.zip] - click ok
    
    Run menu - edit configurations - environment variables - [...] - show
    
    conda install pyspark
    
    conda install pyspark=2.2.0