在Python中将Spark RDD加载到Neo4j
我正在从事一个项目,我正在使用Spark进行数据处理。我的数据现在已被处理,我需要将数据加载到Neo4j。加载到Neo4j之后,我将使用它来展示结果 我希望所有的实现都在Python编程中完成。但我在网上找不到任何图书馆或例子。你能提供一些链接、库或者任何例子吗 我的RDD是成对的RDD。在每个元组中,我必须创建一个关系。在Python中将Spark RDD加载到Neo4j,python,apache-spark,neo4j,cypher,pyspark,Python,Apache Spark,Neo4j,Cypher,Pyspark,我正在从事一个项目,我正在使用Spark进行数据处理。我的数据现在已被处理,我需要将数据加载到Neo4j。加载到Neo4j之后,我将使用它来展示结果 我希望所有的实现都在Python编程中完成。但我在网上找不到任何图书馆或例子。你能提供一些链接、库或者任何例子吗 我的RDD是成对的RDD。在每个元组中,我必须创建一个关系。 PairedRDD Key Value Jack [a,b,c] 为了简单起见,我将RDD转换为 Key value Jack a Jack b Jac
PairedRDD
Key Value
Jack [a,b,c]
为了简单起见,我将RDD转换为
Key value
Jack a
Jack b
Jack c
然后我必须在他们之间建立关系
Jack->a
Jack->b
Jack->c
基于William Answer,我可以直接加载列表。但这一数据引发了密码错误
我试着这样做:
def writeBatch(b):
print("writing batch of " + str(len(b)))
session = driver.session()
session.run('UNWIND {batch} AS elt MERGE (n:user1 {user: elt[0]})', {'batch': b})
session.close()
def write2neo(v):
batch_d.append(v)
for hobby in v[1]:
batch_d.append([v[0],hobby])
global processed
processed += 1
if len(batch) >= 500 or processed >= max:
writeBatch(batch)
batch[:] = []
max = userhobbies.count()
userhobbies.foreach(write2neo)
b是列表的列表。无意识elt是两个元素elt[0]的列表,elt[1]作为键和值
错误
ValueError: Structure signature must be a single byte value
提前感谢。您可以对RDD执行
foreach
,例如:
from neo4j.v1 import GraphDatabase, basic_auth
driver = GraphDatabase.driver("bolt://localhost", auth=basic_auth("",""), encrypted=False)
from pyspark import SparkContext
sc = SparkContext()
dt = sc.parallelize(range(1, 5))
def write2neo(v):
session = driver.session()
session.run("CREATE (n:Node {value: {v} })", {'v': v})
session.close()
dt.foreach(write2neo)
不过,我会改进批处理写入的函数,但这个简单的代码片段适用于基本实现
更新批处理写入的示例
sc = SparkContext()
batch = []
max = None
processed = 0
def writeBatch(b):
print("writing batch of " + str(len(b)))
session = driver.session()
session.run('UNWIND {batch} AS elt CREATE (n:Node {v: elt})', {'batch': b})
session.close()
def write2neo(v):
batch.append(v)
global processed
processed += 1
if len(batch) >= 500 or processed >= max:
writeBatch(batch)
batch[:] = []
dt = sc.parallelize(range(1, 2136))
max = dt.count()
dt.foreach(write2neo)
-
结果是什么
16/09/15 12:25:47 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
writing batch of 500
writing batch of 500
writing batch of 500
writing batch of 500
writing batch of 135
16/09/15 12:25:47 INFO PythonRunner: Times: total = 279, boot = -103, init = 245, finish = 137
16/09/15 12:25:47 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1301 bytes result sent to driver
16/09/15 12:25:47 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 294 ms on localhost (1/1)
16/09/15 12:25:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/09/15 12:25:47 INFO DAGScheduler: ResultStage 1 (foreach at /Users/ikwattro/dev/graphaware/untitled/writeback.py:36) finished in 0.295 s
16/09/15 12:25:47 INFO DAGScheduler: Job 1 finished: foreach at /Users/ikwattro/dev/graphaware/untitled/writeback.py:36, took 0.308263 s
您可以在RDD上执行foreach,例如:
from neo4j.v1 import GraphDatabase, basic_auth
driver = GraphDatabase.driver("bolt://localhost", auth=basic_auth("",""), encrypted=False)
from pyspark import SparkContext
sc = SparkContext()
dt = sc.parallelize(range(1, 5))
def write2neo(v):
session = driver.session()
session.run("CREATE (n:Node {value: {v} })", {'v': v})
session.close()
dt.foreach(write2neo)
不过,我会改进批处理写入的函数,但这个简单的代码片段适用于基本实现
更新批处理写入的示例
sc = SparkContext()
batch = []
max = None
processed = 0
def writeBatch(b):
print("writing batch of " + str(len(b)))
session = driver.session()
session.run('UNWIND {batch} AS elt CREATE (n:Node {v: elt})', {'batch': b})
session.close()
def write2neo(v):
batch.append(v)
global processed
processed += 1
if len(batch) >= 500 or processed >= max:
writeBatch(batch)
batch[:] = []
dt = sc.parallelize(range(1, 2136))
max = dt.count()
dt.foreach(write2neo)
-
结果是什么
16/09/15 12:25:47 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
writing batch of 500
writing batch of 500
writing batch of 500
writing batch of 500
writing batch of 135
16/09/15 12:25:47 INFO PythonRunner: Times: total = 279, boot = -103, init = 245, finish = 137
16/09/15 12:25:47 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1301 bytes result sent to driver
16/09/15 12:25:47 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 294 ms on localhost (1/1)
16/09/15 12:25:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/09/15 12:25:47 INFO DAGScheduler: ResultStage 1 (foreach at /Users/ikwattro/dev/graphaware/untitled/writeback.py:36) finished in 0.295 s
16/09/15 12:25:47 INFO DAGScheduler: Job 1 finished: foreach at /Users/ikwattro/dev/graphaware/untitled/writeback.py:36, took 0.308263 s
答案看起来不错,但庞大的数据集又如何呢。如你所说,如何在批处理中为大型数据集实现同样的功能。我用一个批处理写入的示例更新了我的答案。在展开键值对时,我遇到了一个错误。run('UNWIND{batch}AS elt MERGE(n:user1{user:elt[0]}),{'batch':b})。其中elt[0]是键,我有一个value中的值列表。我们怎么能把这些都放进cypher Chris我做到了。谢谢你的帮助。干杯。答案看起来不错,但是庞大的数据集又如何呢。如你所说,如何在批处理中为大型数据集实现同样的功能。我用一个批处理写入的示例更新了我的答案。在展开键值对时,我遇到了一个错误。run('UNWIND{batch}AS elt MERGE(n:user1{user:elt[0]}),{'batch':b})。其中elt[0]是键,我有一个value中的值列表。我们怎么能把这些都放进cypher Chris我做到了。谢谢你的帮助。干杯。您好,您是如何将您的关键价值杰克[a,b,c]转换为关键价值杰克a杰克b杰克的c@A.HADDAD使用展开运算符或租约?unwind是spark或python操作符,我想在我的java对rddhello中使用,您是如何将键值Jack[a,b,c]转换为键值Jack a Jack b Jack的c@A.HADDAD使用展开运算符或租约?unwind是一个spark或python操作符,我想在java对rdd中使用