Apache spark GraphFrames:合并具有类似列值的边节点

Apache spark GraphFrames:合并具有类似列值的边节点,apache-spark,hadoop,graph,pyspark,graphframes,Apache Spark,Hadoop,Graph,Pyspark,Graphframes,tl;dr:如何简化图形,删除具有相同名称值的边节点 我有一个定义如下的图: import graphframes from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() vertices = spark.createDataFrame([ ('1', 'foo', '1'), ('2', 'bar', '2'), ('3', 'bar', '3'), ('4

tl;dr:如何简化图形,删除具有相同名称值的边节点

我有一个定义如下的图:

import graphframes
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
vertices = spark.createDataFrame([
    ('1', 'foo', '1'),
    ('2', 'bar', '2'),
    ('3', 'bar', '3'),
    ('4', 'bar', '5'),
    ('5', 'baz', '9'),
    ('6', 'blah', '1'),
    ('7', 'blah', '2'),
    ('8', 'blah', '3')
], ['id', 'name', 'value'])

edges = spark.createDataFrame([
    ('1', '2'),
    ('1', '3'),
    ('1', '4'),
    ('1', '5'),
    ('5', '6'),
    ('5', '7'),
    ('5', '8')
], ['src', 'dst'])

f = graphframes.GraphFrame(vertices, edges)
这将生成一个如下所示的图形,其中数字表示顶点ID:

从顶点ID等于1开始,我想简化这个图。这样,具有相似名称值的节点合并为单个节点。生成的图形看起来像是什么 像这样:

请注意,我们只有一个foo ID 1、一个bar ID 2、一个baz ID 5和一个blah ID 6。顶点的值是无关的,只是为了表明每个顶点都是唯一的

我试图实现一个解决方案,但它是黑客,效率极低,我肯定有一个更好的方法,我也不认为它的工作:

f = graphframes.GraphFrame(vertices, edges)

# Get the out degrees for our nodes. Nodes that do not appear in
# this dataframe have zero out degrees.
outs = f.outDegrees

# Merge this with our nodes.
vertices = f.vertices
vertices = f.vertices.join(outs, outs.id == vertices.id, 'left').select(vertices.id, 'name', 'value', 'outDegree')
vertices.show()

# Create a new graph with our out degree nodes.
f = graphframes.GraphFrame(vertices, edges)

# Find paths to all edge vertices from our vertex ID = 1
# Can we make this one operation instead of two??? What if we have more than two hops?
one_hop = f.find('(a)-[e]->(b)').filter('b.outDegree is null').filter('a.id == "1"')
one_hop.show()

two_hop = f.find('(a)-[e1]->(b); (b)-[e2]->(c)').filter('c.outDegree is null').filter('a.id == "1"')
two_hop.show()

# Super ugly, but union the vertices from the `one_hop` and `two_hop` above, and unique
# on the name.
vertices = one_hop.select('a.*').union(one_hop.select('b.*'))
vertices = vertices.union(two_hop.select('a.*').union(two_hop.select('b.*').union(two_hop.select('c.*'))))
vertices = vertices.dropDuplicates(['name'])
vertices.show()

# Do the same for the edges
edges = two_hop.select('e1.*').union(two_hop.select('e2.*')).union(one_hop.select('e.*')).distinct()

# We need to ensure that we have the respective nodes from our edges. We do this  by
# Ensuring the referenced vertex ID is in our `vertices` in both the `src` and the `dst`
# columns - This does NOT seem to work as I'd expect!
edges = edges.join(vertices, vertices.id == edges.src, "left").select("src", "dst")
edges = edges.join(vertices, vertices.id == edges.dst, "left").select("src", "dst")
edges.show()

有没有更简单的方法来删除节点及其对应的边,以便边节点在其名称上是唯一的?

为什么不简单地将名称列视为新id

进口笔架 顶点=spark.createDataFrame[ ‘1’、‘foo’、‘1’, "2","bar","2",, ‘3’、‘酒吧’、‘3’, ‘4’、‘酒吧’、‘5’, "5","baz","9",, “6”,“诸如此类”,“1”, ‘7’、‘废话’、‘2’, “8”,“诸如此类”,“3” ],['id','name','value'] edges=spark.createDataFrame[ '1', '2', '1', '3', '1', '4', '1', '5', '5', '6', '5', '7', '5', '8' ],['src','dst'] 创建仅包含一列的数据帧 新建顶点=顶点。选择顶点。名称。别名“id”。不同 将src id替换为name列 new_edges=edges.joinvertices,edges.src==vertices.id,“左” new_edges=new_edges.选择new_edges.dst,new_edges.name.别名'src' 将dst ID替换为名称列 新边=新边。连接顶点,新边。dst==顶点。id,“左” new_edges=new_edges.选择new_edges.src,new_edges.name.别名'dst' 删除重复边 新边=新边。删除重复项['src','dst'] 新秀 新秀 f=graphframes.graphframes新顶点、新边 输出:

+---+----+
|src| dst|
+---+----+
|foo| baz|
|foo| bar|
|baz|blah|
+---+----+

+----+
|  id|
+----+
|blah|
| bar|
| foo|
| baz|
+----+

从集合中选择一个vx的任何规则,即为什么2、3、4->2;6,7,8->6?没有规则,这是随机选择。重要的一点是节点名是唯一的。id和值在很大程度上是不相关的。