使用数据帧在pyspark中实现Louvain

使用数据帧在pyspark中实现Louvain,pyspark,pyspark-dataframes,Pyspark,Pyspark Dataframes,我正在尝试使用数据帧在pyspark中实现。问题是我的实现真的很慢。我就是这样做的: 我将所有顶点和社区ID收集到简单的python列表中 对于每个顶点-社区ID对,我使用数据帧计算模块化增益(只是一个涉及边权重和/差的奇特公式) 重复,直到没有变化 我做错了什么 我想,如果我能以某种方式将for-each循环并行化,性能就会提高,但我怎么能做到呢 以后编辑: 我可以使用顶点.foreach(changeCommunityId)而不是for-each循环,但是在没有数据帧的情况下,我必须计算模块

我正在尝试使用数据帧在pyspark中实现。问题是我的实现真的很慢。我就是这样做的:

  • 我将所有顶点和社区ID收集到简单的python列表中
  • 对于每个顶点-社区ID对,我使用数据帧计算模块化增益(只是一个涉及边权重和/差的奇特公式)
  • 重复,直到没有变化
  • 我做错了什么

    我想,如果我能以某种方式将for-each循环并行化,性能就会提高,但我怎么能做到呢

    以后编辑: 我可以使用
    顶点.foreach(changeCommunityId)
    而不是for-each循环,但是在没有数据帧的情况下,我必须计算模块化增益(这个奇特的公式)

    请参见下面的代码示例:

    def louvain(self):
    
            oldModularity = 0 # since intially each node represents a community
    
            graph = self.graph
    
            # retrieve graph vertices and edges dataframes
            vertices = verticesDf = self.graph.vertices
            aij = edgesDf = self.graph.edges
    
            canOptimize = True
    
            allCommunityIds = [row['communityId'] for row in verticesDf.select('communityId').distinct().collect()]
            verticesIdsCommunityIds = [(row['id'], row['communityId']) for row in verticesDf.select('id', 'communityId').collect()]
    
            allEdgesSum = self.graph.edges.groupBy().sum('weight').collect()
            m = allEdgesSum[0]['sum(weight)']/2
    
            def computeModularityGain(vertexId, newCommunityId):
    
                # the sum of all weights of the edges within C
                sourceNodesNewCommunity = vertices.join(aij, vertices.id == aij.src) \
                                    .select('weight', 'src', 'communityId') \
                                    .where(vertices.communityId == newCommunityId);
                destinationNodesNewCommunity = vertices.join(aij, vertices.id == aij.dst) \
                                    .select('weight', 'dst', 'communityId') \
                                    .where(vertices.communityId == newCommunityId);
    
                k_in = sourceNodesNewCommunity.join(destinationNodesNewCommunity, sourceNodesNewCommunity.communityId == destinationNodesNewCommunity.communityId) \
                            .count()
                # the rest of the formula computation goes here, I just wanted to show you an example
                # just return some value for the modularity
                return 0.9  
    
            def changeCommunityId(vertexId, currentCommunityId):
    
                maxModularityGain = 0
                maxModularityGainCommunityId = None
                for newCommunityId in allCommunityIds:
                    if (newCommunityId != currentCommunityId):
                        modularityGain = computeModularityGain(vertexId, newCommunityId)
                        if (modularityGain > maxModularityGain):
                            maxModularityGain = modularityGain
                            maxModularityGainCommunityId = newCommunityId
    
                if (maxModularityGain > 0):
                    return maxModularityGainCommunityId
                return currentCommunityId
    
            while canOptimize:
    
                while self.changeInModularity:
    
                    self.changeInModularity = False
    
                    for vertexCommunityIdPair in verticesIdsCommunityIds:
                        vertexId = vertexCommunityIdPair[0]
                        currentCommunityId = vertexCommunityIdPair[1]
                        newCommunityId = changeCommunityId(vertexId, currentCommunityId)
    
                    self.changeInModularity = False
    
                canOptimize = False
    

    您是否考虑过使用图形帧而不是数据帧?这里有一个功能要求:这里实现了scala和spark GraphX中的Louvain模块化的旧版本: