Pyspark Pypark GraphFrames中的图案
我是pyspark的新手,正在努力从一个笔架上找到图案。我得到的结果是空的,尽管我知道顶点和边之间存在关系。我在Cloudera的Jupyter用Spark 1.6运行这个。我的顶点和边以及模式的屏幕截图在以下链接中:和 我正在读书,但没有得到它。。。到目前为止,我有以下代码。我错在哪里Pyspark Pypark GraphFrames中的图案,pyspark,pyspark-sql,graphframes,Pyspark,Pyspark Sql,Graphframes,我是pyspark的新手,正在努力从一个笔架上找到图案。我得到的结果是空的,尽管我知道顶点和边之间存在关系。我在Cloudera的Jupyter用Spark 1.6运行这个。我的顶点和边以及模式的屏幕截图在以下链接中:和 我正在读书,但没有得到它。。。到目前为止,我有以下代码。我错在哪里 #import relevant libraries for Graph Frames from pyspark import SparkContext from pyspark.sql import SQLC
#import relevant libraries for Graph Frames
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import desc
from graphframes import *
#Read the csv files
verticesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/station.csv")
edgesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/trip.csv")
#Renaming the id columns to enable GraphFrame
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Trip ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Start Station", "src")
edgesRDD = edgesRDD.withColumnRenamed("End Station", "dst")
#Register as temporary tables for running the analysis
verticesRDD.registerTempTable("verticesRDD")
edgesRDD.registerTempTable("edgesRDD")
#Note: whether i register the RDDs as temp tables or not, i get the same epty results... so im not sure if this step is really needed
#Make the GraphFrame
g = GraphFrame(verticesRDD, edgesRDD)
print g
#this deisplays the following:
#GraphFrame(v:[id: int, name: string, lat: double, long: double, dockcount: int, landmark: string, installation: string], e:[src: string, dst: string, id: int, Duration: int, Start Date: string, Start Terminal: int, End Date: string, End Terminal: int, Bike #: int, Subscriber Type: string, Zip Code: string])
#Stations where a is connected to b
motifs = g.find("(a)-[e1]->(b)")
motifs.show()
+---+---+---+
| e1| a| b|
+---+---+---+
+---+---+---+
motifs = g.find("(a)-[e1]->(b); (b)-[e2]->(a)")
motifs.show()
+---+---+---+---+
| e1| a| b| e2|
+---+---+---+---+
+---+---+---+---+
motifs = g.find("(a)-[e1]->(b); (b)-[e2]->(c)")
motifs.show()
+---+---+---+---+---+
| e1| a| b| e2| c|
+---+---+---+---+---+
+---+---+---+---+---+
#Stations where a is connected to b, b is connected to c
#but c is not connected to a
motifs = g.find("(a)-[e1]->(b); (b)-[e2]->(c)").filter("(c!=a)")
motifs.show()
+---+---+---+---+---+
| e1| a| b| e2| c|
+---+---+---+---+---+
+---+---+---+---+---+
问题是如何定义顶点。我把station_id重命名为id,实际上,它必须是name。所以这条线
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
一定是
verticesRDD = verticesRDD.withColumnRenamed("name", "id")
母题与此变化正常工作