Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Csv 用Neo4j连接pyspark_Csv_Apache Spark_Pyspark_Neo4j_Write - Fatal编程技术网

Csv 用Neo4j连接pyspark

Csv 用Neo4j连接pyspark,csv,apache-spark,pyspark,neo4j,write,Csv,Apache Spark,Pyspark,Neo4j,Write,我想使用pyspark向neo4j提供数据。我尝试了以下代码 from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[1]") \ .appName("SparkByExamples.com") \ .getOrCreate() df = spark.read.csv("countries.csv") df.wr

我想使用pyspark向neo4j提供数据。我尝试了以下代码

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]") \
    .appName("SparkByExamples.com") \
    .getOrCreate()

df = spark.read.csv("countries.csv")

df.write \
    .format("org.neo4j.spark.DataSource") \
    .mode("ErrorIfExists") \
    .option("url", "bolt://localhost:7687") \
    .option("labels", ":Countries") \
    .save()
但它给了我这样一个错误:

2020-11-11 17:49:19 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2020-11-11 17:49:20 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
  File "xxx", line 13, in <module>
    .option("labels", ":Countries") \
  File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 734, in save
    self._jwrite.save()
  File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/spark/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o40.save.
: java.lang.ClassNotFoundException: Failed to find data source: org.neo4j.spark.DataSource. Please find packages at http://spark.apache.org/third-party-projects.html
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
2020-11-11 17:49:19警告Utils:66-如果需要绑定到其他地址,请设置SPARK\u LOCAL\u IP
2020-11-11 17:49:20警告NativeCodeLoader:62-无法为您的平台加载本机hadoop库。。。在适用的情况下使用内置java类
将默认日志级别设置为“警告”。
要调整日志记录级别,请使用sc.setLogLevel(newLevel)。对于SparkR,使用setLogLevel(newLevel)。
回溯(最近一次呼叫最后一次):
文件“xxx”,第13行,在
.选项(“标签”,“国家”)\
文件“/opt/spark/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/readwriter.py”,第734行,保存
self.\u jwrite.save()
文件“/opt/spark/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py”,第1257行,在__
文件“/opt/spark/spark-2.4.0-bin-hadoop2.7/python/pyspark/sql/utils.py”,第63行,deco格式
返回f(*a,**kw)
文件“/opt/spark/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py”,第328行,在get\u返回值中
py4j.protocol.Py4JJavaError:调用o40.save时出错。
:java.lang.ClassNotFoundException:未能找到数据源:org.neo4j.spark.DataSource。请在以下网址查找包裹:http://spark.apache.org/third-party-projects.html
位于org.apache.spark.sql.execution.datasources.DataSource$.lookUpdateSource(DataSource.scala:657)
有人能帮我克服这个问题吗?
提前谢谢。

问题在于您的格式。您需要将该包包含到Spark应用程序中

试一试


您必须添加neo4j连接器jar,并在启动spark shell时通过它

pyspark --jars neo4j-connector-apache-spark_2.12-4.0.1_for_spark_3.jar

你可以从

下载这个jar,我已经包括了这个软件包。可能从网站()下载jar文件,并使用
--jars/path/to/jar
运行pyspark或spark submit命令。我从给定的链接下载jar文件,并使用spark submit命令传递jar文件,但它仍然给我相同的错误。
pyspark --jars neo4j-connector-apache-spark_2.12-4.0.1_for_spark_3.jar