Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/user-interface/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Amazon web services 无法使用Pyspark和airflow EMR操作员从EMR群集连接雪花_Amazon Web Services_Apache Spark_Pyspark_Airflow_Amazon Emr - Fatal编程技术网

Amazon web services 无法使用Pyspark和airflow EMR操作员从EMR群集连接雪花

Amazon web services 无法使用Pyspark和airflow EMR操作员从EMR群集连接雪花,amazon-web-services,apache-spark,pyspark,airflow,amazon-emr,Amazon Web Services,Apache Spark,Pyspark,Airflow,Amazon Emr,我正在尝试从由airflow EMR操作员启动的EMR群集连接到snowflake,但出现以下错误 py4j.protocol.Py4JJavaError:调用时出错 o147.负载:java.lang.ClassNotFoundException:找不到数据 来源:net.snowflake.spark.snowflake。请在以下网址查找包裹: 以下是我添加到EMRaddsteps操作符中以运行脚本load\u updates.py,并在“Args中描述雪花包的步骤 这就是我如何在load

我正在尝试从由airflow EMR操作员启动的EMR群集连接到snowflake,但出现以下错误

py4j.protocol.Py4JJavaError:调用时出错 o147.负载:java.lang.ClassNotFoundException:找不到数据 来源:net.snowflake.spark.snowflake。请在以下网址查找包裹:

以下是我添加到EMRaddsteps操作符中以运行脚本
load\u updates.py
,并在“Args中描述雪花包的步骤

这就是我如何在load_updates.py脚本中添加雪花creds以提取到pyspark数据帧中的方法

# Set options below
sfOptions = {
  "sfURL" : "xxxx.us-east-1.snowflakecomputing.com",
  "sfUser" : "user",
  "sfPassword" : "xxxx",
  "sfDatabase" : "",
  "sfSchema" : "PUBLIC",
  "sfWarehouse" : ""
}

SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"

query_sql = """select * from cf""";

messages_new = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
  .options(**sfOptions) \
  .option("query", query_sql) \
  .load()
不确定我是否遗漏了某些内容或哪里做错了。

在spark submit命令中,
--package
选项应放在
s3://…/load\u updates.py
之前。否则,它将被视为应用程序参数

试试这个:

步骤=[
{
“名称”:“车队事实”,
“ActionOnFailure”:“终止_集群”,
“HadoopJarStep”:{
“Jar”:“command runner.Jar”,
“Args”:[
“火花提交”,
“--包”,
“net.snowflake:snowflake jdbc:3.8.0,net.snowflake:spark-snowflake_2.11:2.4.14-spark_2.4”,
“s3://dev data lake/spark_files/cf/load_updates.py”,
“INPUT=s3://dev data lake/table_exports/public/”,
“OUTPUT=s3://dev data lake/emr\u OUTPUT/cf/”
]
}
}
]
在spark submit命令中,选项
--package
应该放在
s3://…/load\u updates.py
之前。否则,它将被视为应用程序参数

试试这个:

步骤=[
{
“名称”:“车队事实”,
“ActionOnFailure”:“终止_集群”,
“HadoopJarStep”:{
“Jar”:“command runner.Jar”,
“Args”:[
“火花提交”,
“--包”,
“net.snowflake:snowflake jdbc:3.8.0,net.snowflake:spark-snowflake_2.11:2.4.14-spark_2.4”,
“s3://dev data lake/spark_files/cf/load_updates.py”,
“INPUT=s3://dev data lake/table_exports/public/”,
“OUTPUT=s3://dev data lake/emr\u OUTPUT/cf/”
]
}
}
]

您的答案有效,非常感谢。我还将数据从EMR写入到snowflake表,我可以看到这种情况,但从EMR到snowflake或s3的数据传输也需要将近1个小时。想知道你是否经历过类似的情况,如果是,你是如何面对的?计算部分几乎不需要3-5分钟。您的答案有效,非常感谢。我还将数据从EMR写入到snowflake表,我可以看到这种情况,但从EMR到snowflake或s3的数据传输也需要将近1个小时。想知道你是否经历过类似的情况,如果是,你是如何面对的?计算部分几乎不需要3-5分钟。
# Set options below
sfOptions = {
  "sfURL" : "xxxx.us-east-1.snowflakecomputing.com",
  "sfUser" : "user",
  "sfPassword" : "xxxx",
  "sfDatabase" : "",
  "sfSchema" : "PUBLIC",
  "sfWarehouse" : ""
}

SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"

query_sql = """select * from cf""";

messages_new = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
  .options(**sfOptions) \
  .option("query", query_sql) \
  .load()