Pyspark 图幅PageRank中的错误

Pyspark 图幅PageRank中的错误,pyspark,bigdata,pyspark-sql,pagerank,graphframes,Pyspark,Bigdata,Pyspark Sql,Pagerank,Graphframes,我是pyspark的新手,正在努力了解PageRank是如何工作的。我在Cloudera的Jupyter中使用Spark 1.6。我的顶点和边(以及模式)的屏幕截图位于以下链接中:和 我的代码如下所示: #import relevant libraries for Graph Frames from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.functions import de

我是pyspark的新手,正在努力了解PageRank是如何工作的。我在Cloudera的Jupyter中使用Spark 1.6。我的顶点和边(以及模式)的屏幕截图位于以下链接中:和

我的代码如下所示:

#import relevant libraries for Graph Frames
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import desc
from graphframes import *

#Read the csv files 
verticesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/station.csv")
edgesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/trip.csv")

#Renaming the id columns to enable GraphFrame 
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Trip ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Start Station", "src")
edgesRDD = edgesRDD.withColumnRenamed("End Station", "dst")

#Register as temporary tables for running the analysis
verticesRDD.registerTempTable("verticesRDD")
edgesRDD.registerTempTable("edgesRDD")
#Note: whether i register the RDDs as temp tables or not, i get the same results... so im not sure if this step is really needed

#Make the GraphFrame
g = GraphFrame(verticesRDD, edgesRDD)
现在,当我运行pageRank函数时:

g.pageRank(resetProbability=0.15, maxIter=10)
Py4JJavaError:调用o98.run时出错:org.apache.spark.sparkeException:作业因阶段失败而中止:阶段79.0中的任务0失败1次,最近一次失败:阶段79.0中的任务0.0丢失(TID 2637,localhost):scala.MatchError:[null,null,[913460765,8/31/2015 23:26,Harry Bridges Plaza(渡轮大厦),50/8/23 31/2015∶39,旧金山CalScript(汤森德在第四),70288,订阅者,2139 ] ](类Org.Apache。Skp.SQL.Casa.表达式. GenericRowWithSchema)

Py4JJavaError:调用o166.run时出错:org.graphframes.NoSuchVertexException:GraphFrame算法给定的顶点ID在图形中不存在。GraphFrame中不包含顶点ID ID(v:[ID:int,name:string,lat:double,long:double,dockcount:int,landmark:string,installation:string],e:[src:string,dst:string,ID:int,Duration:int,Start-Date:string,Start-Terminal:int,End-Terminal:int,Bike#:int,订户类型:string,邮政编码:string])

AttributeError:“函数”对象没有属性“重置概率”

ranks = g.pageRank(resetProbability=0.15, maxIter=10).run()
Py4JJavaError:调用o188.run时出错:org.apache.spark.sparkeException:作业因阶段失败而中止:阶段90.0中的任务0失败1次,最近一次失败:阶段90.0中的任务0.0丢失(TID 2641,localhost):scala.MatchError:[null,null,[913460765,8/31/2015 23:26,Harry Bridges Plaza(渡轮大厦),50/8/23 31/2015∶39,旧金山CalScript(汤森德在第四),70288,订阅者,2139 ] ](类Org.Apache。Skp.SQL.Casa.表达式. GenericRowWithSchema)


我正在读书,但不明白哪里出了问题。。任何帮助都将不胜感激

问题在于我如何定义顶点。我把“station_id”改名为“id”,实际上,它必须是“name”。所以这行

verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
一定是

verticesRDD = verticesRDD.withColumnRenamed("name", "id")
pageRank使用此更改正常工作

verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
verticesRDD = verticesRDD.withColumnRenamed("name", "id")