使用数据帧在pyspark中实现Louvain
我正在尝试使用数据帧在pyspark中实现。问题是我的实现真的很慢。我就是这样做的:使用数据帧在pyspark中实现Louvain,pyspark,pyspark-dataframes,Pyspark,Pyspark Dataframes,我正在尝试使用数据帧在pyspark中实现。问题是我的实现真的很慢。我就是这样做的: 我将所有顶点和社区ID收集到简单的python列表中 对于每个顶点-社区ID对,我使用数据帧计算模块化增益(只是一个涉及边权重和/差的奇特公式) 重复,直到没有变化 我做错了什么 我想,如果我能以某种方式将for-each循环并行化,性能就会提高,但我怎么能做到呢 以后编辑: 我可以使用顶点.foreach(changeCommunityId)而不是for-each循环,但是在没有数据帧的情况下,我必须计算模块
顶点.foreach(changeCommunityId)
而不是for-each循环,但是在没有数据帧的情况下,我必须计算模块化增益(这个奇特的公式)
请参见下面的代码示例:
def louvain(self):
oldModularity = 0 # since intially each node represents a community
graph = self.graph
# retrieve graph vertices and edges dataframes
vertices = verticesDf = self.graph.vertices
aij = edgesDf = self.graph.edges
canOptimize = True
allCommunityIds = [row['communityId'] for row in verticesDf.select('communityId').distinct().collect()]
verticesIdsCommunityIds = [(row['id'], row['communityId']) for row in verticesDf.select('id', 'communityId').collect()]
allEdgesSum = self.graph.edges.groupBy().sum('weight').collect()
m = allEdgesSum[0]['sum(weight)']/2
def computeModularityGain(vertexId, newCommunityId):
# the sum of all weights of the edges within C
sourceNodesNewCommunity = vertices.join(aij, vertices.id == aij.src) \
.select('weight', 'src', 'communityId') \
.where(vertices.communityId == newCommunityId);
destinationNodesNewCommunity = vertices.join(aij, vertices.id == aij.dst) \
.select('weight', 'dst', 'communityId') \
.where(vertices.communityId == newCommunityId);
k_in = sourceNodesNewCommunity.join(destinationNodesNewCommunity, sourceNodesNewCommunity.communityId == destinationNodesNewCommunity.communityId) \
.count()
# the rest of the formula computation goes here, I just wanted to show you an example
# just return some value for the modularity
return 0.9
def changeCommunityId(vertexId, currentCommunityId):
maxModularityGain = 0
maxModularityGainCommunityId = None
for newCommunityId in allCommunityIds:
if (newCommunityId != currentCommunityId):
modularityGain = computeModularityGain(vertexId, newCommunityId)
if (modularityGain > maxModularityGain):
maxModularityGain = modularityGain
maxModularityGainCommunityId = newCommunityId
if (maxModularityGain > 0):
return maxModularityGainCommunityId
return currentCommunityId
while canOptimize:
while self.changeInModularity:
self.changeInModularity = False
for vertexCommunityIdPair in verticesIdsCommunityIds:
vertexId = vertexCommunityIdPair[0]
currentCommunityId = vertexCommunityIdPair[1]
newCommunityId = changeCommunityId(vertexId, currentCommunityId)
self.changeInModularity = False
canOptimize = False
您是否考虑过使用图形帧而不是数据帧?这里有一个功能要求:这里实现了scala和spark GraphX中的Louvain模块化的旧版本: