Apache spark GraphFrames：合并具有类似列值的边节点_Apache Spark_Hadoop_Graph_Pyspark_Graphframes

Apache spark GraphFrames：合并具有类似列值的边节点

apache-spark hadoop graph pyspark

Apache spark GraphFrames：合并具有类似列值的边节点,apache-spark,hadoop,graph,pyspark,graphframes,Apache Spark,Hadoop,Graph,Pyspark,Graphframes,tl；dr：如何简化图形，删除具有相同名称值的边节点我有一个定义如下的图： import graphframes from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() vertices = spark.createDataFrame([ ('1', 'foo', '1'), ('2', 'bar', '2'), ('3', 'bar', '3'), ('4

tl；dr：如何简化图形，删除具有相同名称值的边节点

我有一个定义如下的图：

import graphframes
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
vertices = spark.createDataFrame([
    ('1', 'foo', '1'),
    ('2', 'bar', '2'),
    ('3', 'bar', '3'),
    ('4', 'bar', '5'),
    ('5', 'baz', '9'),
    ('6', 'blah', '1'),
    ('7', 'blah', '2'),
    ('8', 'blah', '3')
], ['id', 'name', 'value'])

edges = spark.createDataFrame([
    ('1', '2'),
    ('1', '3'),
    ('1', '4'),
    ('1', '5'),
    ('5', '6'),
    ('5', '7'),
    ('5', '8')
], ['src', 'dst'])

f = graphframes.GraphFrame(vertices, edges)

这将生成一个如下所示的图形，其中数字表示顶点ID：

从顶点ID等于1开始，我想简化这个图。这样，具有相似名称值的节点合并为单个节点。生成的图形看起来像是什么像这样：

请注意，我们只有一个foo ID 1、一个bar ID 2、一个baz ID 5和一个blah ID 6。顶点的值是无关的，只是为了表明每个顶点都是唯一的

我试图实现一个解决方案，但它是黑客，效率极低，我肯定有一个更好的方法，我也不认为它的工作：

f = graphframes.GraphFrame(vertices, edges)

# Get the out degrees for our nodes. Nodes that do not appear in
# this dataframe have zero out degrees.
outs = f.outDegrees

# Merge this with our nodes.
vertices = f.vertices
vertices = f.vertices.join(outs, outs.id == vertices.id, 'left').select(vertices.id, 'name', 'value', 'outDegree')
vertices.show()

# Create a new graph with our out degree nodes.
f = graphframes.GraphFrame(vertices, edges)

# Find paths to all edge vertices from our vertex ID = 1
# Can we make this one operation instead of two??? What if we have more than two hops?
one_hop = f.find('(a)-[e]->(b)').filter('b.outDegree is null').filter('a.id == "1"')
one_hop.show()

two_hop = f.find('(a)-[e1]->(b); (b)-[e2]->(c)').filter('c.outDegree is null').filter('a.id == "1"')
two_hop.show()

# Super ugly, but union the vertices from the `one_hop` and `two_hop` above, and unique
# on the name.
vertices = one_hop.select('a.*').union(one_hop.select('b.*'))
vertices = vertices.union(two_hop.select('a.*').union(two_hop.select('b.*').union(two_hop.select('c.*'))))
vertices = vertices.dropDuplicates(['name'])
vertices.show()

# Do the same for the edges
edges = two_hop.select('e1.*').union(two_hop.select('e2.*')).union(one_hop.select('e.*')).distinct()

# We need to ensure that we have the respective nodes from our edges. We do this  by
# Ensuring the referenced vertex ID is in our `vertices` in both the `src` and the `dst`
# columns - This does NOT seem to work as I'd expect!
edges = edges.join(vertices, vertices.id == edges.src, "left").select("src", "dst")
edges = edges.join(vertices, vertices.id == edges.dst, "left").select("src", "dst")
edges.show()

有没有更简单的方法来删除节点及其对应的边，以便边节点在其名称上是唯一的？

为什么不简单地将名称列视为新id

进口笔架顶点=spark.createDataFrame[ ‘1’、‘foo’、‘1’， "2","bar","2",， ‘3’、‘酒吧’、‘3’， ‘4’、‘酒吧’、‘5’， "5","baz","9",， “6”，“诸如此类”，“1”， ‘7’、‘废话’、‘2’， “8”，“诸如此类”，“3” ]，['id'，'name'，'value'] edges=spark.createDataFrame[ '1', '2', '1', '3', '1', '4', '1', '5', '5', '6', '5', '7', '5', '8' ]，['src'，'dst'] 创建仅包含一列的数据帧新建顶点=顶点。选择顶点。名称。别名“id”。不同将src id替换为name列 new_edges=edges.joinvertices，edges.src==vertices.id，“左” new_edges=new_edges.选择new_edges.dst，new_edges.name.别名'src' 将dst ID替换为名称列新边=新边。连接顶点，新边。dst==顶点。id，“左” new_edges=new_edges.选择new_edges.src，new_edges.name.别名'dst' 删除重复边新边=新边。删除重复项['src'，'dst'] 新秀新秀 f=graphframes.graphframes新顶点、新边输出：

+---+----+
|src| dst|
+---+----+
|foo| baz|
|foo| bar|
|baz|blah|
+---+----+

+----+
|  id|
+----+
|blah|
| bar|
| foo|
| baz|
+----+

从集合中选择一个vx的任何规则，即为什么2、3、4->2；6,7,8->6？没有规则，这是随机选择。重要的一点是节点名是唯一的。id和值在很大程度上是不相关的。