Apache spark PySpark Cassandra数据库连接问题
我想用卡桑德拉和派斯帕克。我可以正确地远程连接到Spark服务器。但在阅读卡桑德拉表格的阶段,我遇到了麻烦。我尝试了所有的datastax连接器,我更改了Spark配置(核心、内存等),但我无法完成。(下面代码中的注释行是我的尝试。) 这是我的python代码Apache spark PySpark Cassandra数据库连接问题,apache-spark,pyspark,cassandra,spark-cassandra-connector,Apache Spark,Pyspark,Cassandra,Spark Cassandra Connector,我想用卡桑德拉和派斯帕克。我可以正确地远程连接到Spark服务器。但在阅读卡桑德拉表格的阶段,我遇到了麻烦。我尝试了所有的datastax连接器,我更改了Spark配置(核心、内存等),但我无法完成。(下面代码中的注释行是我的尝试。) 这是我的python代码 import os os.environ['JAVA_HOME']="C:\Program Files\Java\jdk1.8.0_271" os.environ['HADOOP_HOME']="E:\etc
import os
os.environ['JAVA_HOME']="C:\Program Files\Java\jdk1.8.0_271"
os.environ['HADOOP_HOME']="E:\etc\spark-3.0.1-bin-hadoop2.7"
os.environ['PYSPARK_DRIVER_PYTHON']="/usr/local/bin/python3.7"
os.environ['PYSPARK_PYTHON']="/usr/local/bin/python3.7"
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=XX.XX.XX.XX spark.cassandra.auth.username=username spark.cassandra.auth.password=passwd pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars .ivy2\jars\spark-cassandra-connector-driver_2.12-3.0.0-alpha2.jar pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-alpha2 pyspark-shell'
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setMaster("spark://YY.YY.YY:7077").setAppName("My app")
conf.set("spark.shuffle.service.enabled", "false")
conf.set("spark.dynamicAllocation.enabled","false")
conf.set("spark.executor.cores", "2")
conf.set("spark.executor.memory", "5g")
conf.set("spark.executor.instances", "1")
conf.set("spark.jars", "C:\\Users\\verianalizi\\.ivy2\\jars\\spark-cassandra-connector_2.12-3.0.0-beta.jar")
conf.set("spark.cassandra.connection.host","XX.XX.XX.XX")
conf.set("spark.cassandra.auth.username","username")
conf.set("spark.cassandra.auth.password","passwd")
conf.set("spark.cassandra.connection.port", "9042")
# conf.set("spark.sql.catalog.myCatalog", "com.datastax.spark.connector.datasource.CassandraCatalog")
sc = SparkContext(conf=conf)
# sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
list_p = [('John',19),('Smith',29),('Adam',35),('Henry',50)]
rdd = sc.parallelize(list_p)
ppl = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
DF_ppl = sqlContext.createDataFrame(ppl)
# It works well until now
def load_and_get_table_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.option("keyspace",keys_space_name)\
.option("table",table_name)\
.load()
return table_df
movies = load_and_get_table_df("weather", "currentweatherconditions")
我得到的错误是;
有人知道吗?发生这种情况是因为您只指定了
spark.jars
属性,并指向单个jar。但spark cassandra连接器取决于该列表中未包含的其他罐子的数量。我建议使用带有坐标com.datasax.spark:spark-cassandra-connector_2.12:3.0.0
的spark.jars.packages
或在spark.jars
中指定具有所有必要依赖项的路径
顺便说一句,3.0是几个月前发布的-为什么你还在使用beta版?谢谢你回答我的问题。我尝试了下面的代码和类似代码
os.environ['PYSPARK\u SUBMIT\u ARGS']='--packages com.datastax.spark:spark-cassandra-connector\u 2.12:3.0.0 PYSPARK shell'
,但结果是一样的。我不能说JupyterI找到了解决方案。当然可以使用上面的信息conf.set(“spark.executor.jars”、“C:\\Users\\verianalizi\\.ivy2\\jars\\spark-cassandra-connector-assembly_2.12-3.0.0.jar”)conf.set(“spark.driver.extraClassPath”、“C:\\Users\\verianalizi\.ivy2\\jars\\spark-cassandra-connector-assembly_2.12-3.0.0.jar”)