Pyspark 图幅PageRank中的错误_Pyspark_Bigdata_Pyspark Sql_Pagerank_Graphframes

Pyspark 图幅PageRank中的错误

pyspark

Pyspark 图幅PageRank中的错误,pyspark,bigdata,pyspark-sql,pagerank,graphframes,Pyspark,Bigdata,Pyspark Sql,Pagerank,Graphframes,我是pyspark的新手，正在努力了解PageRank是如何工作的。我在Cloudera的Jupyter中使用Spark 1.6。我的顶点和边（以及模式）的屏幕截图位于以下链接中：和我的代码如下所示： #import relevant libraries for Graph Frames from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.functions import de

我是pyspark的新手，正在努力了解PageRank是如何工作的。我在Cloudera的Jupyter中使用Spark 1.6。我的顶点和边（以及模式）的屏幕截图位于以下链接中：和

我的代码如下所示：

#import relevant libraries for Graph Frames
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import desc
from graphframes import *

#Read the csv files 
verticesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/station.csv")
edgesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/trip.csv")

#Renaming the id columns to enable GraphFrame 
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Trip ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Start Station", "src")
edgesRDD = edgesRDD.withColumnRenamed("End Station", "dst")

#Register as temporary tables for running the analysis
verticesRDD.registerTempTable("verticesRDD")
edgesRDD.registerTempTable("edgesRDD")
#Note: whether i register the RDDs as temp tables or not, i get the same results... so im not sure if this step is really needed

#Make the GraphFrame
g = GraphFrame(verticesRDD, edgesRDD)

现在，当我运行pageRank函数时：

g.pageRank(resetProbability=0.15, maxIter=10)

Py4JJavaError:调用o98.run时出错：org.apache.spark.sparkeException:作业因阶段失败而中止：阶段79.0中的任务0失败1次，最近一次失败：阶段79.0中的任务0.0丢失（TID 2637，localhost）：scala.MatchError:[null，null，[913460765,8/31/2015 23:26，Harry Bridges Plaza（渡轮大厦），50/8/23 31/2015∶39，旧金山CalScript（汤森德在第四），70288，订阅者，2139 ] ]（类Org.Apache。Skp.SQL.Casa.表达式. GenericRowWithSchema）

Py4JJavaError:调用o166.run时出错：org.graphframes.NoSuchVertexException:GraphFrame算法给定的顶点ID在图形中不存在。GraphFrame中不包含顶点ID ID（v:[ID:int，name:string，lat:double，long:double，dockcount:int，landmark:string，installation:string]，e:[src:string，dst:string，ID:int，Duration:int，Start-Date:string，Start-Terminal:int，End-Terminal:int，Bike#：int，订户类型：string，邮政编码：string]）

AttributeError:“函数”对象没有属性“重置概率”

ranks = g.pageRank(resetProbability=0.15, maxIter=10).run()

Py4JJavaError:调用o188.run时出错：org.apache.spark.sparkeException:作业因阶段失败而中止：阶段90.0中的任务0失败1次，最近一次失败：阶段90.0中的任务0.0丢失（TID 2641，localhost）：scala.MatchError:[null，null，[913460765,8/31/2015 23:26，Harry Bridges Plaza（渡轮大厦），50/8/23 31/2015∶39，旧金山CalScript（汤森德在第四），70288，订阅者，2139 ] ]（类Org.Apache。Skp.SQL.Casa.表达式. GenericRowWithSchema）

我正在读书，但不明白哪里出了问题。。任何帮助都将不胜感激

问题在于我如何定义顶点。我把“station_id”改名为“id”，实际上，它必须是“name”。所以这行

verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")

一定是

verticesRDD = verticesRDD.withColumnRenamed("name", "id")

pageRank使用此更改正常工作

verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")

verticesRDD = verticesRDD.withColumnRenamed("name", "id")