Python 使用带有PyCharm的Graphframe
我花了将近两天的时间浏览互联网,但我无法解决这个问题。我正在尝试安装(版本:0.2.0-spark2.0-s_2.11)以通过PyCharm与spark一起运行,但是,尽管我尽了最大努力,这是不可能的 我几乎什么都试过了。请注意,在发布答案之前,我也检查了这个网站 以下是我尝试运行的代码:Python 使用带有PyCharm的Graphframe,python,installation,pycharm,pyspark,graphframes,Python,Installation,Pycharm,Pyspark,Graphframes,我花了将近两天的时间浏览互联网,但我无法解决这个问题。我正在尝试安装(版本:0.2.0-spark2.0-s_2.11)以通过PyCharm与spark一起运行,但是,尽管我尽了最大努力,这是不可能的 我几乎什么都试过了。请注意,在发布答案之前,我也检查了这个网站 以下是我尝试运行的代码: # IMPORT OTHER LIBS -------------------------------------------------------- import os import sys import
# IMPORT OTHER LIBS --------------------------------------------------------
import os
import sys
import pandas as pd
# IMPORT SPARK ------------------------------------------------------------------------------------#
# Path to Spark source folder
USER_FILE_PATH = "/Users/<username>"
SPARK_PATH = "/PycharmProjects/GenesAssociation"
SPARK_FILE = "/spark-2.0.0-bin-hadoop2.7"
SPARK_HOME = USER_FILE_PATH + SPARK_PATH + SPARK_FILE
os.environ['SPARK_HOME'] = SPARK_HOME
# Append pySpark to Python Path
sys.path.append(SPARK_HOME + "/python")
sys.path.append(SPARK_HOME + "/python" + "/lib/py4j-0.10.1-src.zip")
try:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.graphframes import GraphFrame
except ImportError as ex:
print "Can not import Spark Modules", ex
sys.exit(1)
# GLOBAL VARIABLES --------------------------------------------------------- -----------------------#
SC = SparkContext('local')
SQL_CONTEXT = SQLContext(SC)
# MAIN CODE ---------------------------------------------------------------------------------------#
if __name__ == "__main__":
# Main Path to CSV files
DATA_PATH = '/PycharmProjects/GenesAssociation/data/'
FILE_NAME = 'gene_gene_associations_50k.csv'
# LOAD DATA CSV USING PANDAS -----------------------------------------------------------------#
print "STEP 1: Loading Gene Nodes -------------------------------------------------------------"
# Read csv file and load as df
GENES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME,
usecols=['OFFICIAL_SYMBOL_A'],
low_memory=True,
iterator=True,
chunksize=1000)
# Concatenate chunks into list & convert to dataFrame
GENES_DF = pd.DataFrame(pd.concat(list(GENES), ignore_index=True))
# Remove duplicates
GENES_DF_CLEAN = GENES_DF.drop_duplicates(keep='first')
# Name Columns
GENES_DF_CLEAN.columns = ['gene_id']
# Output dataFrame
print GENES_DF_CLEAN
# Create vertices
VERTICES = SQL_CONTEXT.createDataFrame(GENES_DF_CLEAN)
# Show some vertices
print VERTICES.take(5)
print "STEP 2: Loading Gene Edges -------------------------------------------------------------"
# Read csv file and load as df
EDGES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME,
usecols=['OFFICIAL_SYMBOL_A', 'OFFICIAL_SYMBOL_B', 'EXPERIMENTAL_SYSTEM'],
low_memory=True,
iterator=True,
chunksize=1000)
# Concatenate chunks into list & convert to dataFrame
EDGES_DF = pd.DataFrame(pd.concat(list(EDGES), ignore_index=True))
# Name Columns
EDGES_DF.columns = ["src", "dst", "rel_type"]
# Output dataFrame
print EDGES_DF
# Create vertices
EDGES = SQL_CONTEXT.createDataFrame(EDGES_DF)
# Show some edges
print EDGES.take(5)
g = gf.GraphFrame(VERTICES, EDGES)
#导入其他库--------------------------------------------------------
导入操作系统
导入系统
作为pd进口熊猫
#进口火花------------------------------------------------------------------------------------#
#Spark源文件夹的路径
USER\u FILE\u PATH=“/Users/”
SPARK_PATH=“/PycharmProjects/GenesAssociation”
SPARK_FILE=“/SPARK-2.0.0-bin-hadoop2.7”
SPARK\u HOME=用户\u文件\u路径+SPARK\u路径+SPARK\u文件
os.environ['SPARK_HOME']=SPARK_HOME
#将pySpark附加到Python路径
sys.path.append(SPARK\u HOME+“/python”)
sys.path.append(SPARK_HOME+“/python”+“/lib/py4j-0.10.1-src.zip”)
尝试:
从pyspark导入SparkContext
从pyspark导入SparkConf
从pyspark.sql导入SQLContext
从pyspark.graphframes导入GraphFrame
除以下情况外:
打印“无法导入火花模块”,例如
系统出口(1)
#全局变量-----------------------------------------------------------------#
SC=SparkContext(“本地”)
SQL_CONTEXT=SQLContext(SC)
#主代码---------------------------------------------------------------------------------------#
如果名称=“\uuuuu main\uuuuuuuu”:
#CSV文件的主路径
数据路径='/PycharmProjects/GenesAssociation/DATA/'
文件名='gene\u gene\u associations\u 50k.csv'
#使用PANDAS加载CSV数据-----------------------------------------------------------------#
打印“步骤1:加载基因节点-----------------------------------------------------”
#读取csv文件并加载为df
GENES=pd.read\u csv(用户文件路径+数据路径+文件名,
usecols=[“官方符号”],
低内存=真,
迭代器=真,
chunksize=1000)
#将块连接到列表中并转换为数据帧
GENES_DF=pd.DataFrame(pd.concat(list(GENES),ignore_index=True))
#删除重复项
GENES_DF_CLEAN=GENES_DF.drop_duplicates(keep='first')
#名称列
GENES_DF_CLEAN.columns=['gene_id']
#输出数据帧
打印基因?DF?清洁
#创建顶点
顶点=SQL\u CONTEXT.createDataFrame(基因\u DF\u CLEAN)
#显示一些顶点
打印顶点。取(5)
打印“第2步:加载基因边缘-----------------------------------------------------”
#读取csv文件并加载为df
EDGES=pd.read\u csv(用户文件路径+数据路径+文件名,
usecols=[“官方符号”、“官方符号”、“实验系统”],
低内存=真,
迭代器=真,
chunksize=1000)
#将块连接到列表中并转换为数据帧
EDGES_DF=pd.DataFrame(pd.concat(列表(边),忽略_index=True))
#名称列
EDGES_DF.columns=[“src”、“dst”、“rel_type”]
#输出数据帧
打印边缘
#创建顶点
EDGES=SQL\u CONTEXT.createDataFrame(EDGES\u DF)
#露出一些棱角
打印边缘。取(5)
g=gf.图形框架(顶点、边)
不用说,我已经尝试将graphframes目录(看看我做了什么)包含到spark的pyspark目录中。但似乎这还不够。。。我试过的其他方法都失败了。如果能帮上忙,我将不胜感激。您可以在下面看到我收到的错误消息:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/09/19 12:46:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/19 12:46:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
STEP 1: Loading Gene Nodes -------------------------------------------------------------
gene_id
0 MAP2K4
1 MYPN
2 ACVR1
3 GATA2
4 RPA2
5 ARF1
6 ARF3
8 XRN1
9 APP
10 APLP1
11 CITED2
12 EP300
13 APOB
14 ARRB2
15 CSF1R
16 PRRC2A
17 LSM1
18 SLC4A1
19 BCL3
20 ADRB1
21 BRCA1
25 ARVCF
26 PCBD1
27 PSEN2
28 CAPN3
29 ITPR1
30 MAGI1
31 RB1
32 TSG101
33 ORC1
... ...
49379 WDR26
49380 WDR5B
49382 NLE1
49383 WDR12
49385 WDR53
49386 WDR59
49387 WDR61
49409 CHD6
49422 DACT1
49424 KMT2B
49438 SMARCA1
49459 DCLRE1A
49469 F2RL1
49472 SENP8
49475 TSPY1
49479 SERPINB5
49521 HOXA11
49548 SYF2
49553 FOXN3
49557 MLANA
49608 REPIN1
49609 GMNN
49670 HIST2H2BE
49767 BCL7C
49797 SIRT3
49810 KLF4
49858 RHO
49896 MAGEA2
49907 SUV420H2
49958 SAP30L
[6025 rows x 1 columns]
16/09/19 12:46:08 WARN TaskSetManager: Stage 0 contains a task of very large size (107 KB). The maximum recommended task size is 100 KB.
[Row(gene_id=u'MAP2K4'), Row(gene_id=u'MYPN'), Row(gene_id=u'ACVR1'), Row(gene_id=u'GATA2'), Row(gene_id=u'RPA2')]
STEP 2: Loading Gene Edges -------------------------------------------------------------
src dst rel_type
0 MAP2K4 FLNC Two-hybrid
1 MYPN ACTN2 Two-hybrid
2 ACVR1 FNTA Two-hybrid
3 GATA2 PML Two-hybrid
4 RPA2 STAT3 Two-hybrid
5 ARF1 GGA3 Two-hybrid
6 ARF3 ARFIP2 Two-hybrid
7 ARF3 ARFIP1 Two-hybrid
8 XRN1 ALDOA Two-hybrid
9 APP APPBP2 Two-hybrid
10 APLP1 DAB1 Two-hybrid
11 CITED2 TFAP2A Two-hybrid
12 EP300 TFAP2A Two-hybrid
13 APOB MTTP Two-hybrid
14 ARRB2 RALGDS Two-hybrid
15 CSF1R GRB2 Two-hybrid
16 PRRC2A GRB2 Two-hybrid
17 LSM1 NARS Two-hybrid
18 SLC4A1 SLC4A1AP Two-hybrid
19 BCL3 BARD1 Two-hybrid
20 ADRB1 GIPC1 Two-hybrid
21 BRCA1 ATF1 Two-hybrid
22 BRCA1 MSH2 Two-hybrid
23 BRCA1 BARD1 Two-hybrid
24 BRCA1 MSH6 Two-hybrid
25 ARVCF CDH15 Two-hybrid
26 PCBD1 CACNA1C Two-hybrid
27 PSEN2 CAPN1 Two-hybrid
28 CAPN3 TTN Two-hybrid
29 ITPR1 CA8 Two-hybrid
... ... ... ...
49969 SAP30 HDAC3 Affinity Capture-Western
49970 BRCA1 RBBP8 Co-localization
49971 BRCA1 BRCA1 Biochemical Activity
49972 SET TREX1 Co-purification
49973 SET TREX1 Reconstituted Complex
49974 PLAGL1 EP300 Reconstituted Complex
49975 PLAGL1 CREBBP Reconstituted Complex
49976 EP300 PLAGL1 Affinity Capture-Western
49977 MTA1 ESR1 Reconstituted Complex
49978 SIRT2 EP300 Affinity Capture-Western
49979 EP300 SIRT2 Affinity Capture-Western
49980 EP300 HDAC1 Affinity Capture-Western
49981 EP300 SIRT2 Biochemical Activity
49982 MIER1 CREBBP Reconstituted Complex
49983 SMARCA4 SIN3A Affinity Capture-Western
49984 SMARCA4 HDAC2 Affinity Capture-Western
49985 ESR1 NCOA6 Affinity Capture-Western
49986 ESR1 TOP2B Affinity Capture-Western
49987 ESR1 PRKDC Affinity Capture-Western
49988 ESR1 PARP1 Affinity Capture-Western
49989 ESR1 XRCC5 Affinity Capture-Western
49990 ESR1 XRCC6 Affinity Capture-Western
49991 PARP1 TOP2B Affinity Capture-Western
49992 PARP1 PRKDC Affinity Capture-Western
49993 PARP1 XRCC5 Affinity Capture-Western
49994 PARP1 XRCC6 Affinity Capture-Western
49995 SIRT3 XRCC6 Affinity Capture-Western
49996 SIRT3 XRCC6 Reconstituted Complex
49997 SIRT3 XRCC6 Biochemical Activity
49998 HDAC1 PAX3 Affinity Capture-Western
[49999 rows x 3 columns]
16/09/19 12:46:11 WARN TaskSetManager: Stage 1 contains a task of very large size (1211 KB). The maximum recommended task size is 100 KB.
[Row(src=u'MAP2K4', dst=u'FLNC', rel_type=u'Two-hybrid'), Row(src=u'MYPN', dst=u'ACTN2', rel_type=u'Two-hybrid'), Row(src=u'ACVR1', dst=u'FNTA', rel_type=u'Two-hybrid'), Row(src=u'GATA2', dst=u'PML', rel_type=u'Two-hybrid'), Row(src=u'RPA2', dst=u'STAT3', rel_type=u'Two-hybrid')]
Traceback (most recent call last):
File "/Users/username/PycharmProjects/GenesAssociation/__init__.py", line 99, in <module>
g = gf.GraphFrame(VERTICES, EDGES)
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 62, in __init__
self._jvm_gf_api = _java_api(self._sc)
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 34, in _java_api
return jsc._jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName) \
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Process finished with exit code 1
使用Spark的默认log4j配置文件:org/apache/Spark/log4j-defaults.properties
将默认日志级别设置为“警告”。
要调整日志记录级别,请使用sc.setLogLevel(newLevel)。
16/09/19 12:46:02警告NativeCodeLoader:无法为您的平台加载本机hadoop库。。。在适用的情况下使用内置java类
16/09/19 12:46:03警告Utils:服务“SparkUI”无法在端口4040上绑定。正在尝试端口4041。
步骤1:加载基因节点-------------------------------------------------------------
基因识别
0 MAP2K4
1 MYPN
2 ACVR1
3 GATA2
4 RPA2
5 ARF1
6 ARF3
8 XRN1
9应用程序
10 APLP1
11城市2
12 EP300
13载脂蛋白B
14 ARRB2
15 CSF1R
16 PRRC2A
17 LSM1
18 SLC4A1
19 BCL3
20 ADRB1
21 BRCA1
25 ARVCF
26 PCBD1
27 PSEN2
28 CAPN3
29 ITPR1
30.1
31 RB1
32 TSG101
33 ORC1
... ...
49379 WDR26
49380 WDR5B
49382 NLE1
49383 WDR12
49385 WDR53
49386 WDR59
49387 WDR61
49409 CHD6
49422 DACT1
49424KMT2B
49438 SMARCA1
49459 DCLRE1A
49469 F2RL1
49472 SENP8
49475 TSPY1
49479 SERPINB5
49521 HOXA11
49548SYF2
49553 FOXN3
49557百万美元
49608雷平1
49609 GMNN
49670 HIST2H2BE
49767 BCL7C
49797 SIRT3
49810 KLF4
49858 RHO
49896 MAGEA2
49907 SUV420H2
49958 SAP30L
[6025行x 1列]
16/09/19 12:46:08警告TaskSetManager:阶段0包含非常大(107 KB)的任务。建议的最大任务大小为100 KB。
[行(gene_id=u'MAP2K4')、行(gene_id=u'MYPN')、行(gene_id=u'ACVR1')、行(gene_id=u'GATA2')、行(gene_id=u'RPA2')]
步骤2:加载基因边缘-------------------------------------------------------------
src dst rel_类型
0 MAP2K4 FLNC双混合
1 MYPN ACTN2双杂交
2 ACVR1 FNTA双混合动力
3 GATA2 PML双杂交
4 RPA2 STAT3
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 pyspark-shell"
)
spark = SparkSession.builder.getOrCreate()
import os
import sys
SPARK_HOME = ...
os.environ["SPARK_HOME"] = SPARK_HOME
# os.environ["PYSPARK_SUBMIT_ARGS"] = ... If not set in PyCharm config
sys.path.append(os.path.join(SPARK_HOME, "python"))
sys.path.append(os.path.join(SPARK_HOME, "python/lib/py4j-0.10.3-src.zip"))
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
v = spark.createDataFrame([("a", "foo"), ("b", "bar"),], ["id", "attr"])
e = spark.createDataFrame([("a", "b", "foobar")], ["src", "dst", "rel"])
from graphframes import *
g = GraphFrame(v, e)
g.inDegrees.show()
spark.stop()