Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/gwt/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Python中将Spark RDD加载到Neo4j_Python_Apache Spark_Neo4j_Cypher_Pyspark - Fatal编程技术网

在Python中将Spark RDD加载到Neo4j

在Python中将Spark RDD加载到Neo4j,python,apache-spark,neo4j,cypher,pyspark,Python,Apache Spark,Neo4j,Cypher,Pyspark,我正在从事一个项目,我正在使用Spark进行数据处理。我的数据现在已被处理,我需要将数据加载到Neo4j。加载到Neo4j之后,我将使用它来展示结果 我希望所有的实现都在Python编程中完成。但我在网上找不到任何图书馆或例子。你能提供一些链接、库或者任何例子吗 我的RDD是成对的RDD。在每个元组中,我必须创建一个关系。 PairedRDD Key Value Jack [a,b,c] 为了简单起见,我将RDD转换为 Key value Jack a Jack b Jac

我正在从事一个项目,我正在使用Spark进行数据处理。我的数据现在已被处理,我需要将数据加载到Neo4j。加载到Neo4j之后,我将使用它来展示结果

我希望所有的实现都在Python编程中完成。但我在网上找不到任何图书馆或例子。你能提供一些链接、库或者任何例子吗

我的RDD是成对的RDD。在每个元组中,我必须创建一个关系。
PairedRDD

Key   Value
Jack  [a,b,c]
为了简单起见,我将RDD转换为

 Key  value
 Jack  a
 Jack  b
 Jack  c
然后我必须在他们之间建立关系

 Jack->a    
 Jack->b
 Jack->c
基于William Answer,我可以直接加载列表。但这一数据引发了密码错误

我试着这样做:

 def writeBatch(b):
    print("writing batch of " + str(len(b)))
    session = driver.session()
    session.run('UNWIND {batch} AS elt MERGE (n:user1 {user: elt[0]})', {'batch': b})
    session.close()

def write2neo(v):
    batch_d.append(v)
    for hobby in v[1]:
        batch_d.append([v[0],hobby])

    global processed
    processed += 1
    if len(batch) >= 500 or processed >= max:
        writeBatch(batch)
        batch[:] = []


max = userhobbies.count()
userhobbies.foreach(write2neo)
b是列表的列表。无意识elt是两个元素elt[0]的列表,elt[1]作为键和值

错误

ValueError: Structure signature must be a single byte value

提前感谢。

您可以对RDD执行
foreach
,例如:

from neo4j.v1 import GraphDatabase, basic_auth
driver = GraphDatabase.driver("bolt://localhost", auth=basic_auth("",""), encrypted=False)
from pyspark import SparkContext

sc = SparkContext()
dt = sc.parallelize(range(1, 5))

def write2neo(v):
    session = driver.session()
    session.run("CREATE (n:Node {value: {v} })", {'v': v})
    session.close()


dt.foreach(write2neo)
不过,我会改进批处理写入的函数,但这个简单的代码片段适用于基本实现

更新批处理写入的示例

sc = SparkContext()
batch = []
max = None
processed = 0

def writeBatch(b):
    print("writing batch of " + str(len(b)))
    session = driver.session()
    session.run('UNWIND {batch} AS elt CREATE (n:Node {v: elt})', {'batch': b})
    session.close()

def write2neo(v):
    batch.append(v)
    global processed
    processed += 1
    if len(batch) >= 500 or processed >= max:
        writeBatch(batch)
        batch[:] = []

dt = sc.parallelize(range(1, 2136))
max = dt.count()
dt.foreach(write2neo)
- 结果是什么

16/09/15 12:25:47 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
writing batch of 500
writing batch of 500
writing batch of 500
writing batch of 500
writing batch of 135
16/09/15 12:25:47 INFO PythonRunner: Times: total = 279, boot = -103, init = 245, finish = 137
16/09/15 12:25:47 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1301 bytes result sent to driver
16/09/15 12:25:47 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 294 ms on localhost (1/1)
16/09/15 12:25:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/09/15 12:25:47 INFO DAGScheduler: ResultStage 1 (foreach at /Users/ikwattro/dev/graphaware/untitled/writeback.py:36) finished in 0.295 s
16/09/15 12:25:47 INFO DAGScheduler: Job 1 finished: foreach at /Users/ikwattro/dev/graphaware/untitled/writeback.py:36, took 0.308263 s

您可以在RDD上执行foreach,例如:

from neo4j.v1 import GraphDatabase, basic_auth
driver = GraphDatabase.driver("bolt://localhost", auth=basic_auth("",""), encrypted=False)
from pyspark import SparkContext

sc = SparkContext()
dt = sc.parallelize(range(1, 5))

def write2neo(v):
    session = driver.session()
    session.run("CREATE (n:Node {value: {v} })", {'v': v})
    session.close()


dt.foreach(write2neo)
不过,我会改进批处理写入的函数,但这个简单的代码片段适用于基本实现

更新批处理写入的示例

sc = SparkContext()
batch = []
max = None
processed = 0

def writeBatch(b):
    print("writing batch of " + str(len(b)))
    session = driver.session()
    session.run('UNWIND {batch} AS elt CREATE (n:Node {v: elt})', {'batch': b})
    session.close()

def write2neo(v):
    batch.append(v)
    global processed
    processed += 1
    if len(batch) >= 500 or processed >= max:
        writeBatch(batch)
        batch[:] = []

dt = sc.parallelize(range(1, 2136))
max = dt.count()
dt.foreach(write2neo)
- 结果是什么

16/09/15 12:25:47 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
writing batch of 500
writing batch of 500
writing batch of 500
writing batch of 500
writing batch of 135
16/09/15 12:25:47 INFO PythonRunner: Times: total = 279, boot = -103, init = 245, finish = 137
16/09/15 12:25:47 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1301 bytes result sent to driver
16/09/15 12:25:47 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 294 ms on localhost (1/1)
16/09/15 12:25:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/09/15 12:25:47 INFO DAGScheduler: ResultStage 1 (foreach at /Users/ikwattro/dev/graphaware/untitled/writeback.py:36) finished in 0.295 s
16/09/15 12:25:47 INFO DAGScheduler: Job 1 finished: foreach at /Users/ikwattro/dev/graphaware/untitled/writeback.py:36, took 0.308263 s

答案看起来不错,但庞大的数据集又如何呢。如你所说,如何在批处理中为大型数据集实现同样的功能。我用一个批处理写入的示例更新了我的答案。在展开键值对时,我遇到了一个错误。run('UNWIND{batch}AS elt MERGE(n:user1{user:elt[0]}),{'batch':b})。其中elt[0]是键,我有一个value中的值列表。我们怎么能把这些都放进cypher Chris我做到了。谢谢你的帮助。干杯。答案看起来不错,但是庞大的数据集又如何呢。如你所说,如何在批处理中为大型数据集实现同样的功能。我用一个批处理写入的示例更新了我的答案。在展开键值对时,我遇到了一个错误。run('UNWIND{batch}AS elt MERGE(n:user1{user:elt[0]}),{'batch':b})。其中elt[0]是键,我有一个value中的值列表。我们怎么能把这些都放进cypher Chris我做到了。谢谢你的帮助。干杯。您好,您是如何将您的关键价值杰克[a,b,c]转换为关键价值杰克a杰克b杰克的c@A.HADDAD使用展开运算符或租约?unwind是spark或python操作符,我想在我的java对rddhello中使用,您是如何将键值Jack[a,b,c]转换为键值Jack a Jack b Jack的c@A.HADDAD使用展开运算符或租约?unwind是一个spark或python操作符,我想在java对rdd中使用