Python 使用带有PyCharm的Graphframe

Python 使用带有PyCharm的Graphframe,python,installation,pycharm,pyspark,graphframes,Python,Installation,Pycharm,Pyspark,Graphframes,我花了将近两天的时间浏览互联网,但我无法解决这个问题。我正在尝试安装(版本:0.2.0-spark2.0-s_2.11)以通过PyCharm与spark一起运行,但是,尽管我尽了最大努力,这是不可能的 我几乎什么都试过了。请注意,在发布答案之前,我也检查了这个网站 以下是我尝试运行的代码: # IMPORT OTHER LIBS -------------------------------------------------------- import os import sys import

我花了将近两天的时间浏览互联网,但我无法解决这个问题。我正在尝试安装(版本:0.2.0-spark2.0-s_2.11)以通过PyCharm与spark一起运行,但是,尽管我尽了最大努力,这是不可能的

我几乎什么都试过了。请注意,在发布答案之前,我也检查了这个网站

以下是我尝试运行的代码:

# IMPORT OTHER LIBS --------------------------------------------------------
import os
import sys
import pandas as pd

# IMPORT SPARK ------------------------------------------------------------------------------------#
# Path to Spark source folder
USER_FILE_PATH = "/Users/<username>"
SPARK_PATH = "/PycharmProjects/GenesAssociation"
SPARK_FILE = "/spark-2.0.0-bin-hadoop2.7"
SPARK_HOME = USER_FILE_PATH + SPARK_PATH + SPARK_FILE
os.environ['SPARK_HOME'] = SPARK_HOME

# Append pySpark to Python Path
sys.path.append(SPARK_HOME + "/python")
sys.path.append(SPARK_HOME + "/python" + "/lib/py4j-0.10.1-src.zip")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf
    from pyspark.sql import SQLContext
    from pyspark.graphframes import GraphFrame

except ImportError as ex:
    print "Can not import Spark Modules", ex
    sys.exit(1)

# GLOBAL VARIABLES ---------------------------------------------------------    -----------------------#
SC = SparkContext('local')
SQL_CONTEXT = SQLContext(SC)

# MAIN CODE ---------------------------------------------------------------------------------------#
if __name__ == "__main__":

    # Main Path to CSV files
    DATA_PATH = '/PycharmProjects/GenesAssociation/data/'
    FILE_NAME = 'gene_gene_associations_50k.csv'

    # LOAD DATA CSV USING  PANDAS -----------------------------------------------------------------#
    print "STEP 1: Loading Gene Nodes -------------------------------------------------------------"
    # Read csv file and load as df
    GENES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME,
                        usecols=['OFFICIAL_SYMBOL_A'],
                        low_memory=True,
                        iterator=True,
                        chunksize=1000)

    # Concatenate chunks into list & convert to dataFrame
    GENES_DF = pd.DataFrame(pd.concat(list(GENES), ignore_index=True))

    # Remove duplicates
    GENES_DF_CLEAN = GENES_DF.drop_duplicates(keep='first')

    # Name Columns
    GENES_DF_CLEAN.columns = ['gene_id']

    # Output dataFrame
    print GENES_DF_CLEAN

    # Create vertices
    VERTICES = SQL_CONTEXT.createDataFrame(GENES_DF_CLEAN)

    # Show some vertices
    print VERTICES.take(5)

    print "STEP 2: Loading Gene Edges -------------------------------------------------------------"
    # Read csv file and load as df
    EDGES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME,
                        usecols=['OFFICIAL_SYMBOL_A', 'OFFICIAL_SYMBOL_B', 'EXPERIMENTAL_SYSTEM'],
                        low_memory=True,
                        iterator=True,
                        chunksize=1000)

    # Concatenate chunks into list & convert to dataFrame
    EDGES_DF = pd.DataFrame(pd.concat(list(EDGES), ignore_index=True))

    # Name Columns
    EDGES_DF.columns = ["src", "dst", "rel_type"]

    # Output dataFrame
    print EDGES_DF

    # Create vertices
    EDGES = SQL_CONTEXT.createDataFrame(EDGES_DF)

    # Show some edges
    print EDGES.take(5)

    g = gf.GraphFrame(VERTICES, EDGES)
#导入其他库--------------------------------------------------------
导入操作系统
导入系统
作为pd进口熊猫
#进口火花------------------------------------------------------------------------------------#
#Spark源文件夹的路径
USER\u FILE\u PATH=“/Users/”
SPARK_PATH=“/PycharmProjects/GenesAssociation”
SPARK_FILE=“/SPARK-2.0.0-bin-hadoop2.7”
SPARK\u HOME=用户\u文件\u路径+SPARK\u路径+SPARK\u文件
os.environ['SPARK_HOME']=SPARK_HOME
#将pySpark附加到Python路径
sys.path.append(SPARK\u HOME+“/python”)
sys.path.append(SPARK_HOME+“/python”+“/lib/py4j-0.10.1-src.zip”)
尝试:
从pyspark导入SparkContext
从pyspark导入SparkConf
从pyspark.sql导入SQLContext
从pyspark.graphframes导入GraphFrame
除以下情况外:
打印“无法导入火花模块”,例如
系统出口(1)
#全局变量-----------------------------------------------------------------#
SC=SparkContext(“本地”)
SQL_CONTEXT=SQLContext(SC)
#主代码---------------------------------------------------------------------------------------#
如果名称=“\uuuuu main\uuuuuuuu”:
#CSV文件的主路径
数据路径='/PycharmProjects/GenesAssociation/DATA/'
文件名='gene\u gene\u associations\u 50k.csv'
#使用PANDAS加载CSV数据-----------------------------------------------------------------#
打印“步骤1:加载基因节点-----------------------------------------------------”
#读取csv文件并加载为df
GENES=pd.read\u csv(用户文件路径+数据路径+文件名,
usecols=[“官方符号”],
低内存=真,
迭代器=真,
chunksize=1000)
#将块连接到列表中并转换为数据帧
GENES_DF=pd.DataFrame(pd.concat(list(GENES),ignore_index=True))
#删除重复项
GENES_DF_CLEAN=GENES_DF.drop_duplicates(keep='first')
#名称列
GENES_DF_CLEAN.columns=['gene_id']
#输出数据帧
打印基因?DF?清洁
#创建顶点
顶点=SQL\u CONTEXT.createDataFrame(基因\u DF\u CLEAN)
#显示一些顶点
打印顶点。取(5)
打印“第2步:加载基因边缘-----------------------------------------------------”
#读取csv文件并加载为df
EDGES=pd.read\u csv(用户文件路径+数据路径+文件名,
usecols=[“官方符号”、“官方符号”、“实验系统”],
低内存=真,
迭代器=真,
chunksize=1000)
#将块连接到列表中并转换为数据帧
EDGES_DF=pd.DataFrame(pd.concat(列表(边),忽略_index=True))
#名称列
EDGES_DF.columns=[“src”、“dst”、“rel_type”]
#输出数据帧
打印边缘
#创建顶点
EDGES=SQL\u CONTEXT.createDataFrame(EDGES\u DF)
#露出一些棱角
打印边缘。取(5)
g=gf.图形框架(顶点、边)
不用说,我已经尝试将graphframes目录(看看我做了什么)包含到spark的pyspark目录中。但似乎这还不够。。。我试过的其他方法都失败了。如果能帮上忙,我将不胜感激。您可以在下面看到我收到的错误消息:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/09/19 12:46:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/19 12:46:03 WARN Utils: Service 'SparkUI' could not bind on port 4040.     Attempting port 4041.

STEP 1: Loading Gene Nodes -------------------------------------------------------------
         gene_id
0         MAP2K4
1           MYPN
2          ACVR1
3          GATA2
4           RPA2
5           ARF1
6           ARF3
8           XRN1
9            APP
10         APLP1
11        CITED2
12         EP300
13          APOB
14         ARRB2
15         CSF1R
16        PRRC2A
17          LSM1
18        SLC4A1
19          BCL3
20         ADRB1
21         BRCA1
25         ARVCF
26         PCBD1
27         PSEN2
28         CAPN3
29         ITPR1
30         MAGI1
31           RB1
32        TSG101
33          ORC1
...          ...
49379      WDR26
49380      WDR5B
49382       NLE1
49383      WDR12
49385      WDR53
49386      WDR59
49387      WDR61
49409       CHD6
49422      DACT1
49424      KMT2B
49438    SMARCA1
49459    DCLRE1A
49469      F2RL1
49472      SENP8
49475      TSPY1
49479   SERPINB5
49521     HOXA11
49548       SYF2
49553      FOXN3
49557      MLANA
49608     REPIN1
49609       GMNN
49670  HIST2H2BE
49767      BCL7C
49797      SIRT3
49810       KLF4
49858        RHO
49896     MAGEA2
49907   SUV420H2
49958     SAP30L

[6025 rows x 1 columns]
16/09/19 12:46:08 WARN TaskSetManager: Stage 0 contains a task of very large size (107 KB). The maximum recommended task size is 100 KB.
[Row(gene_id=u'MAP2K4'), Row(gene_id=u'MYPN'), Row(gene_id=u'ACVR1'), Row(gene_id=u'GATA2'), Row(gene_id=u'RPA2')]
STEP 2: Loading Gene Edges -------------------------------------------------------------
           src       dst                  rel_type
0       MAP2K4      FLNC                Two-hybrid
1         MYPN     ACTN2                Two-hybrid
2        ACVR1      FNTA                Two-hybrid
3        GATA2       PML                Two-hybrid
4         RPA2     STAT3                Two-hybrid
5         ARF1      GGA3                Two-hybrid
6         ARF3    ARFIP2                Two-hybrid
7         ARF3    ARFIP1                Two-hybrid
8         XRN1     ALDOA                Two-hybrid
9          APP    APPBP2                Two-hybrid
10       APLP1      DAB1                Two-hybrid
11      CITED2    TFAP2A                Two-hybrid
12       EP300    TFAP2A                Two-hybrid
13        APOB      MTTP                Two-hybrid
14       ARRB2    RALGDS                Two-hybrid
15       CSF1R      GRB2                Two-hybrid
16      PRRC2A      GRB2                Two-hybrid
17        LSM1      NARS                Two-hybrid
18      SLC4A1  SLC4A1AP                Two-hybrid
19        BCL3     BARD1                Two-hybrid
20       ADRB1     GIPC1                Two-hybrid
21       BRCA1      ATF1                Two-hybrid
22       BRCA1      MSH2                Two-hybrid
23       BRCA1     BARD1                Two-hybrid
24       BRCA1      MSH6                Two-hybrid
25       ARVCF     CDH15                Two-hybrid
26       PCBD1   CACNA1C                Two-hybrid
27       PSEN2     CAPN1                Two-hybrid
28       CAPN3       TTN                Two-hybrid
29       ITPR1       CA8                Two-hybrid
...        ...       ...                       ...
49969    SAP30     HDAC3  Affinity Capture-Western
49970    BRCA1     RBBP8           Co-localization
49971    BRCA1     BRCA1      Biochemical Activity
49972      SET     TREX1           Co-purification
49973      SET     TREX1     Reconstituted Complex
49974   PLAGL1     EP300     Reconstituted Complex
49975   PLAGL1    CREBBP     Reconstituted Complex
49976    EP300    PLAGL1  Affinity Capture-Western
49977     MTA1      ESR1     Reconstituted Complex
49978    SIRT2     EP300  Affinity Capture-Western
49979    EP300     SIRT2  Affinity Capture-Western
49980    EP300     HDAC1  Affinity Capture-Western
49981    EP300     SIRT2      Biochemical Activity
49982    MIER1    CREBBP     Reconstituted Complex
49983  SMARCA4     SIN3A  Affinity Capture-Western
49984  SMARCA4     HDAC2  Affinity Capture-Western
49985     ESR1     NCOA6  Affinity Capture-Western
49986     ESR1     TOP2B  Affinity Capture-Western
49987     ESR1     PRKDC  Affinity Capture-Western
49988     ESR1     PARP1  Affinity Capture-Western
49989     ESR1     XRCC5  Affinity Capture-Western
49990     ESR1     XRCC6  Affinity Capture-Western
49991    PARP1     TOP2B  Affinity Capture-Western
49992    PARP1     PRKDC  Affinity Capture-Western
49993    PARP1     XRCC5  Affinity Capture-Western
49994    PARP1     XRCC6  Affinity Capture-Western
49995    SIRT3     XRCC6  Affinity Capture-Western
49996    SIRT3     XRCC6     Reconstituted Complex
49997    SIRT3     XRCC6      Biochemical Activity
49998    HDAC1      PAX3  Affinity Capture-Western

[49999 rows x 3 columns]
16/09/19 12:46:11 WARN TaskSetManager: Stage 1 contains a task of very large size (1211 KB). The maximum recommended task size is 100 KB.
[Row(src=u'MAP2K4', dst=u'FLNC', rel_type=u'Two-hybrid'), Row(src=u'MYPN', dst=u'ACTN2', rel_type=u'Two-hybrid'), Row(src=u'ACVR1', dst=u'FNTA', rel_type=u'Two-hybrid'), Row(src=u'GATA2', dst=u'PML', rel_type=u'Two-hybrid'), Row(src=u'RPA2', dst=u'STAT3', rel_type=u'Two-hybrid')]
Traceback (most recent call last):
  File "/Users/username/PycharmProjects/GenesAssociation/__init__.py", line 99, in <module>
    g = gf.GraphFrame(VERTICES, EDGES)
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 62, in __init__
    self._jvm_gf_api = _java_api(self._sc)
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 34, in _java_api
    return jsc._jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName) \
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:211)
    at java.lang.Thread.run(Thread.java:745)


Process finished with exit code 1
使用Spark的默认log4j配置文件:org/apache/Spark/log4j-defaults.properties
将默认日志级别设置为“警告”。
要调整日志记录级别,请使用sc.setLogLevel(newLevel)。
16/09/19 12:46:02警告NativeCodeLoader:无法为您的平台加载本机hadoop库。。。在适用的情况下使用内置java类
16/09/19 12:46:03警告Utils:服务“SparkUI”无法在端口4040上绑定。正在尝试端口4041。
步骤1:加载基因节点-------------------------------------------------------------
基因识别
0 MAP2K4
1 MYPN
2 ACVR1
3 GATA2
4 RPA2
5 ARF1
6 ARF3
8 XRN1
9应用程序
10 APLP1
11城市2
12 EP300
13载脂蛋白B
14 ARRB2
15 CSF1R
16 PRRC2A
17 LSM1
18 SLC4A1
19 BCL3
20 ADRB1
21 BRCA1
25 ARVCF
26 PCBD1
27 PSEN2
28 CAPN3
29 ITPR1
30.1
31 RB1
32 TSG101
33 ORC1
...          ...
49379 WDR26
49380 WDR5B
49382 NLE1
49383 WDR12
49385 WDR53
49386 WDR59
49387 WDR61
49409 CHD6
49422 DACT1
49424KMT2B
49438 SMARCA1
49459 DCLRE1A
49469 F2RL1
49472 SENP8
49475 TSPY1
49479 SERPINB5
49521 HOXA11
49548SYF2
49553 FOXN3
49557百万美元
49608雷平1
49609 GMNN
49670 HIST2H2BE
49767 BCL7C
49797 SIRT3
49810 KLF4
49858 RHO
49896 MAGEA2
49907 SUV420H2
49958 SAP30L
[6025行x 1列]
16/09/19 12:46:08警告TaskSetManager:阶段0包含非常大(107 KB)的任务。建议的最大任务大小为100 KB。
[行(gene_id=u'MAP2K4')、行(gene_id=u'MYPN')、行(gene_id=u'ACVR1')、行(gene_id=u'GATA2')、行(gene_id=u'RPA2')]
步骤2:加载基因边缘-------------------------------------------------------------
src dst rel_类型
0 MAP2K4 FLNC双混合
1 MYPN ACTN2双杂交
2 ACVR1 FNTA双混合动力
3 GATA2 PML双杂交
4 RPA2 STAT3
os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 pyspark-shell"
)
spark = SparkSession.builder.getOrCreate()
import os
import sys

SPARK_HOME = ...
os.environ["SPARK_HOME"] = SPARK_HOME
# os.environ["PYSPARK_SUBMIT_ARGS"] = ... If not set in PyCharm config

sys.path.append(os.path.join(SPARK_HOME, "python"))
sys.path.append(os.path.join(SPARK_HOME, "python/lib/py4j-0.10.3-src.zip"))

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

v = spark.createDataFrame([("a",  "foo"), ("b", "bar"),], ["id", "attr"])
e = spark.createDataFrame([("a", "b", "foobar")], ["src", "dst", "rel"])


from graphframes import *

g = GraphFrame(v, e)
g.inDegrees.show()

spark.stop()