Csv 用Neo4j连接pyspark
我想使用pyspark向neo4j提供数据。我尝试了以下代码Csv 用Neo4j连接pyspark,csv,apache-spark,pyspark,neo4j,write,Csv,Apache Spark,Pyspark,Neo4j,Write,我想使用pyspark向neo4j提供数据。我尝试了以下代码 from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[1]") \ .appName("SparkByExamples.com") \ .getOrCreate() df = spark.read.csv("countries.csv") df.wr
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]") \
.appName("SparkByExamples.com") \
.getOrCreate()
df = spark.read.csv("countries.csv")
df.write \
.format("org.neo4j.spark.DataSource") \
.mode("ErrorIfExists") \
.option("url", "bolt://localhost:7687") \
.option("labels", ":Countries") \
.save()
但它给了我这样一个错误:
2020-11-11 17:49:19 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2020-11-11 17:49:20 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File "xxx", line 13, in <module>
.option("labels", ":Countries") \
File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 734, in save
self._jwrite.save()
File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o40.save.
: java.lang.ClassNotFoundException: Failed to find data source: org.neo4j.spark.DataSource. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
2020-11-11 17:49:19警告Utils:66-如果需要绑定到其他地址,请设置SPARK\u LOCAL\u IP
2020-11-11 17:49:20警告NativeCodeLoader:62-无法为您的平台加载本机hadoop库。。。在适用的情况下使用内置java类
将默认日志级别设置为“警告”。
要调整日志记录级别,请使用sc.setLogLevel(newLevel)。对于SparkR,使用setLogLevel(newLevel)。
回溯(最近一次呼叫最后一次):
文件“xxx”,第13行,在
.选项(“标签”,“国家”)\
文件“/opt/spark/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/readwriter.py”,第734行,保存
self.\u jwrite.save()
文件“/opt/spark/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py”,第1257行,在__
文件“/opt/spark/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/utils.py”,第63行,deco格式
返回f(*a,**kw)
文件“/opt/spark/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py”,第328行,在get\u返回值中
py4j.protocol.Py4JJavaError:调用o40.save时出错。
:java.lang.ClassNotFoundException:未能找到数据源:org.neo4j.spark.DataSource。请在以下网址查找包裹:http://spark.apache.org/third-party-projects.html
位于org.apache.spark.sql.execution.datasources.DataSource$.lookUpdateSource(DataSource.scala:657)
有人能帮我克服这个问题吗?
提前谢谢。问题在于您的格式。您需要将该包包含到Spark应用程序中 试一试
您必须添加neo4j连接器jar,并在启动spark shell时通过它
pyspark --jars neo4j-connector-apache-spark_2.12-4.0.1_for_spark_3.jar
你可以从下载这个jar,我已经包括了这个软件包。可能从网站()下载jar文件,并使用
--jars/path/to/jar
运行pyspark或spark submit命令。我从给定的链接下载jar文件,并使用spark submit命令传递jar文件,但它仍然给我相同的错误。
pyspark --jars neo4j-connector-apache-spark_2.12-4.0.1_for_spark_3.jar