Apache spark pyspark graph用于查找大型图形的连接组件
我试图使用pyspark中graphframes中的Apache spark pyspark graph用于查找大型图形的连接组件,apache-spark,pyspark,connected-components,graphframes,Apache Spark,Pyspark,Connected Components,Graphframes,我试图使用pyspark中graphframes中的connectedComponents()来计算一个相当大的图的连接组件,该图大约有1800K个顶点和500k条边 edgeDF.printSchema() root |-- src: string (nullable = true) |-- dst: string (nullable = true) vertDF.printSchema() root |-- id: string (nullable = true) vertDF.
connectedComponents()
来计算一个相当大的图的连接组件,该图大约有1800K个顶点和500k条边
edgeDF.printSchema()
root
|-- src: string (nullable = true)
|-- dst: string (nullable = true)
vertDF.printSchema()
root
|-- id: string (nullable = true)
vertDF.count()
1879806
edgeDF.count()
452196
custGraph = gf.GraphFrame(vertDF,edgeDF)
comp = custGraph.connectedComponents()
即使6小时后,任务也没有结束。我在一台装有windows的机器上运行pyspark
a。在给定的设置中进行这样的计算是否可行
b。我收到了如下警告信息
[rdd_73_2, rdd_90_2]
[Stage 21:=========> (2 + 2) / 4][Stage 22:> (0 + 2) / 4]16/10/13 01:28:42 WARN Executor: 2 block locks were not released by TID = 632:
[rdd_73_0, rdd_90_0]
[Stage 21:=============> (3 + 1) / 4][Stage 22:> (0 + 3) / 4]16/10/13 01:28:43 WARN Executor: 2 block locks were not released by TID = 633:
[rdd_73_1, rdd_90_1]
[Stage 37:> (0 + 4) / 4][Stage 38:> (0 + 0) / 4]16/10/13 01:28:47 WARN Executor: 3 block locks were not released by TID = 844:
[rdd_90_0, rdd_104_0, rdd_107_0]
这是什么意思
c。如何在graphframe中指定图形是无向的?我们需要在两个方向上添加边吗 连接的组件不会自动将图形视为无向的吗?我认为您不必担心(c)。关于(b),您可能希望在GraphFrames跟踪器上关注这个问题:连接的组件不会自动将图形视为无向的吗?我认为您不必担心(c)。关于(b),您可能希望在GraphFrames tracker上关注此问题: