Classnotfound从pyspark本地计算机连接到雪花时出错

Classnotfound从pyspark本地计算机连接到雪花时出错,pyspark,snowflake-cloud-data-platform,Pyspark,Snowflake Cloud Data Platform,我正在本地机器上尝试从Pyspark连接到snowflake 我的代码如下所示 from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.sql.types import * from pyspark import SparkConf, SparkContext sc = SparkContext("local", "sf_tes

我正在本地机器上尝试从Pyspark连接到snowflake

我的代码如下所示

    from pyspark import SparkConf, SparkContext
    from pyspark.sql import SQLContext
    from pyspark.sql.types import *
    from pyspark import SparkConf, SparkContext

    sc = SparkContext("local", "sf_test")
    spark = SQLContext(sc)
    spark_conf = SparkConf().setMaster('local').setAppName('sf_test')

    sfOptions = {
      "sfURL" : "someaccount.some.address",
      "sfAccount" : "someaccount",
      "sfUser" : "someuser",
      "sfPassword" : "somepassword",
      "sfDatabase" : "somedb",
      "sfSchema" : "someschema",
      "sfWarehouse" : "somedw",
      "sfRole" : "somerole",
    }

SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
当我运行这个特定的代码块时,我得到一个错误

df = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query","""select * from 
 "PRED_ORDER_DEV"."SALES"."V_PosAnalysis" pos 
    ORDER BY pos."SAPAccountNumber", pos."SAPMaterialNumber" """).load()
Py4JJavaError:调用o115.load时出错: java.lang.ClassNotFoundException:未能找到数据源: net.snowflake.spark.snowflake。请在以下网址查找包裹: 在 org.apache.spark.sql.execution.datasources.DataSource$.lookUpdateSource(DataSource.scala:657) 在 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194) 在 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)

我已经加载了连接器和JDBCJAR文件,并将它们添加到类路径中

pyspark --packages net.snowflake:snowflake-jdbc:3.11.1,net.snowflake:spark-snowflake_2.11:2.5.7-spark_2.4
CLASSPATH = C:\Program Files\Java\jre1.8.0_241\bin;C:\snowflake_jar

我希望能够连接到snowflake并使用Pyspark读取数据。任何帮助都将不胜感激

要运行pyspark应用程序,您可以使用
spark submit
并在
--packages
选项下传递JAR。我假设您希望运行客户机模式,因此将其传递给
--deploy mode
选项,最后添加pyspark程序的名称

如下所示:

spark-submit --packages net.snowflake:snowflake-jdbc:3.11.1,net.snowflake:spark-snowflake_2.11:2.5.7-spark_2.4 --deploy-mode client spark-snowflake.py
下面是工作脚本

您应该在项目的根目录中创建目录jar,并添加两个jar:

  • snowflake-jdbc-3.13.4.jar(jdbc驱动程序)
  • spark-snowflake_2.12-2.9.0-spark_3.1.jar(火花连接器)
接下来,您需要了解什么是scala编译器版本。我使用的是PyCharm,所以双击shift并搜索“scala”。您将看到类似于scala-compiler-2.12.10.jar的内容。scala编译器版本(在本例中为2.12)的第一位数字应与spark connector(spark-snowflake_uStrong>2.12-2.9.0-spark_3.1.jar)的第一位数字相同

  • 司机-
  • 连接器-
下载连接器前检查SCALA编译器版本

from pyspark.sql import SparkSession

sfOptions = {
    "sfURL": "sfURL",
    "sfUser": "sfUser",
    "sfPassword": "sfPassword",
    "sfDatabase": "sfDatabase",
    "sfSchema": "sfSchema",
    "sfWarehouse": "sfWarehouse",
    "sfRole": "sfRole",
}

spark = SparkSession.builder \
    .master("local") \
    .appName("snowflake-test") \
    .config('spark.jars', 'jar/snowflake-jdbc-3.13.4.jar,jar/spark-snowflake_2.12-2.9.0-spark_3.1.jar') \
    .getOrCreate()


SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"

df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
    .options(**sfOptions) \
    .option("query", "select * from some_table") \
    .load()

df.show()