使用数据帧在pyspark中实现Louvain_Pyspark_Pyspark Dataframes

使用数据帧在pyspark中实现Louvain

pyspark

使用数据帧在pyspark中实现Louvain,pyspark,pyspark-dataframes,Pyspark,Pyspark Dataframes,我正在尝试使用数据帧在pyspark中实现。问题是我的实现真的很慢。我就是这样做的：我将所有顶点和社区ID收集到简单的python列表中对于每个顶点-社区ID对，我使用数据帧计算模块化增益（只是一个涉及边权重和/差的奇特公式）重复，直到没有变化我做错了什么我想，如果我能以某种方式将for-each循环并行化，性能就会提高，但我怎么能做到呢以后编辑：我可以使用顶点.foreach（changeCommunityId）而不是for-each循环，但是在没有数据帧的情况下，我必须计算模块

我正在尝试使用数据帧在pyspark中实现。问题是我的实现真的很慢。我就是这样做的：

我将所有顶点和社区ID收集到简单的python列表中

对于每个顶点-社区ID对，我使用数据帧计算模块化增益（只是一个涉及边权重和/差的奇特公式）

重复，直到没有变化

我做错了什么

我想，如果我能以某种方式将for-each循环并行化，性能就会提高，但我怎么能做到呢

以后编辑： 我可以使用

顶点.foreach（changeCommunityId）

而不是for-each循环，但是在没有数据帧的情况下，我必须计算模块化增益（这个奇特的公式）

请参见下面的代码示例：

def louvain(self):

        oldModularity = 0 # since intially each node represents a community

        graph = self.graph

        # retrieve graph vertices and edges dataframes
        vertices = verticesDf = self.graph.vertices
        aij = edgesDf = self.graph.edges

        canOptimize = True

        allCommunityIds = [row['communityId'] for row in verticesDf.select('communityId').distinct().collect()]
        verticesIdsCommunityIds = [(row['id'], row['communityId']) for row in verticesDf.select('id', 'communityId').collect()]

        allEdgesSum = self.graph.edges.groupBy().sum('weight').collect()
        m = allEdgesSum[0]['sum(weight)']/2

        def computeModularityGain(vertexId, newCommunityId):

            # the sum of all weights of the edges within C
            sourceNodesNewCommunity = vertices.join(aij, vertices.id == aij.src) \
                                .select('weight', 'src', 'communityId') \
                                .where(vertices.communityId == newCommunityId);
            destinationNodesNewCommunity = vertices.join(aij, vertices.id == aij.dst) \
                                .select('weight', 'dst', 'communityId') \
                                .where(vertices.communityId == newCommunityId);

            k_in = sourceNodesNewCommunity.join(destinationNodesNewCommunity, sourceNodesNewCommunity.communityId == destinationNodesNewCommunity.communityId) \
                        .count()
            # the rest of the formula computation goes here, I just wanted to show you an example
            # just return some value for the modularity
            return 0.9  

        def changeCommunityId(vertexId, currentCommunityId):

            maxModularityGain = 0
            maxModularityGainCommunityId = None
            for newCommunityId in allCommunityIds:
                if (newCommunityId != currentCommunityId):
                    modularityGain = computeModularityGain(vertexId, newCommunityId)
                    if (modularityGain > maxModularityGain):
                        maxModularityGain = modularityGain
                        maxModularityGainCommunityId = newCommunityId

            if (maxModularityGain > 0):
                return maxModularityGainCommunityId
            return currentCommunityId

        while canOptimize:

            while self.changeInModularity:

                self.changeInModularity = False

                for vertexCommunityIdPair in verticesIdsCommunityIds:
                    vertexId = vertexCommunityIdPair[0]
                    currentCommunityId = vertexCommunityIdPair[1]
                    newCommunityId = changeCommunityId(vertexId, currentCommunityId)

                self.changeInModularity = False

            canOptimize = False

您是否考虑过使用图形帧而不是数据帧？这里有一个功能要求：这里实现了scala和spark GraphX中的Louvain模块化的旧版本：