Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 无法访问通过Google dataproc群集上的气流提交的PySpark作业中的环境变量_Apache Spark_Pyspark_Google Cloud Platform_Airflow_Google Cloud Dataproc - Fatal编程技术网

Apache spark 无法访问通过Google dataproc群集上的气流提交的PySpark作业中的环境变量

Apache spark 无法访问通过Google dataproc群集上的气流提交的PySpark作业中的环境变量,apache-spark,pyspark,google-cloud-platform,airflow,google-cloud-dataproc,Apache Spark,Pyspark,Google Cloud Platform,Airflow,Google Cloud Dataproc,我正在Google Dataproc集群上通过气流运行PySpark作业 此作业从AWS S3下载数据,并在处理后将其存储在Google云存储上。因此,为了让执行者从googledataproc访问S3 bucket,我将AWS凭证存储在环境变量中(附加到/etc/environment),同时通过institalization操作创建Dataproc集群 我使用Boto3获取凭据,然后设置Spark配置 boto3_session = boto3.Session() aws_credentia

我正在Google Dataproc集群上通过气流运行PySpark作业

此作业从AWS S3下载数据,并在处理后将其存储在Google云存储上。因此,为了让执行者从googledataproc访问S3 bucket,我将AWS凭证存储在环境变量中(附加到/etc/environment),同时通过institalization操作创建Dataproc集群

我使用Boto3获取凭据,然后设置Spark配置

boto3_session = boto3.Session()
aws_credentials = boto3_session.get_credentials()
aws_credentials = aws_credentials.get_frozen_credentials()
aws_access_key = aws_credentials.access_key
aws_secret_key = aws_credentials.secret_key

spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_key)
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key)
初始化操作文件:

#!/usr/bin/env bash

#This script installs required packages and configures the environment

wget https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py
sudo pip install boto3
sudo pip install google-cloud-storage


echo "AWS_ACCESS_KEY_ID=XXXXXXXXXXXXX" | sudo tee --append /etc/environment
echo "AWS_SECRET_ACCESS_KEY=xXXxXXXXXX" | sudo tee --append /etc/environment

source /etc/environment
但是我得到了以下错误:这意味着我的Spark进程无法从环境变量中获取配置

18/07/19 22:02:16 INFO org.spark_project.jetty.util.log: Logging initialized @2351ms
18/07/19 22:02:16 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT
18/07/19 22:02:16 INFO org.spark_project.jetty.server.Server: Started @2454ms
18/07/19 22:02:16 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@75b67e54{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
18/07/19 22:02:16 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.7-hadoop2
18/07/19 22:02:17 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at cluster-1-m/10.164.0.2:8032
18/07/19 22:02:19 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1532036330220_0004
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
18/07/19 22:02:23 INFO DependencyResolver: ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
18/07/19 22:02:23 INFO hive.metastore: Trying to connect to metastore with URI thrift://cluster-1-m:9083
18/07/19 22:02:23 INFO hive.metastore: Connected to metastore.
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created local directory: /tmp/952f73b3-a59c-4a23-a04a-f05dc4e67d89_resources
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created HDFS directory: /tmp/hive/root/952f73b3-a59c-4a23-a04a-f05dc4e67d89
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created local directory: /tmp/root/952f73b3-a59c-4a23-a04a-f05dc4e67d89
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created HDFS directory: /tmp/hive/root/952f73b3-a59c-4a23-a04a-f05dc4e67d89/_tmp_space.db
18/07/19 22:02:24 INFO DependencyResolver: ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created local directory: /tmp/59c3fef5-6c9e-49c9-bf31-69634430e4e6_resources
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created HDFS directory: /tmp/hive/root/59c3fef5-6c9e-49c9-bf31-69634430e4e6
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created local directory: /tmp/root/59c3fef5-6c9e-49c9-bf31-69634430e4e6
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState: Created HDFS directory: /tmp/hive/root/59c3fef5-6c9e-49c9-bf31-69634430e4e6/_tmp_space.db
Traceback (most recent call last):
  File "/tmp/get_search_query_logs_620dea04/download_elk_event_logs.py", line 237, in <module>
    aws_credentials = aws_credentials.get_frozen_credentials()
AttributeError: 'NoneType' object has no attribute 'get_frozen_credentials'
18/07/19 22:02:24 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@75b67e54{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
18/07/19 22:02:16 INFO org.spark_project.jetty.util.log:Logging initialized@2351ms
18/07/19 22:02:16 INFO org.spark_project.jetty.server.server:jetty-9.3.z-SNAPSHOT
18/07/19 22:02:16 INFO org.spark_project.jetty.server.server:Started@2454ms
18/07/19 22:02:16 INFO org.spark_project.jetty.server.AbstractConnector:已开始ServerConnector@75b67e54{HTTP/1.1[HTTP/1.1]}{0.0.0.0:4040}
18/07/19 22:02:16 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoop文件系统数据库:GHFS版本:1.6.7-hadoop2
18/07/19 22:02:17 INFO org.apache.hadoop.warn.client.RMProxy:连接到位于cluster-1-m/10.164.0.2:8032的ResourceManager
18/07/19 22:02:19 INFO org.apache.hadoop.warn.client.api.impl.YarnClientImpl:提交的申请
在配置单元HOME或配置单元DIR中找不到ivysettings.xml文件,将使用/etc/HIVE/CONF.dist/ivysettings.xml
18/07/19 22:02:23信息依赖解析程序:在配置单元主页或配置单元目录中找不到ivysettings.xml文件,将使用/etc/HIVE/CONF.dist/ivysettings.xml
18/07/19 22:02:23信息hive.metastore:尝试使用URI连接到metastorethrift://cluster-1-m:9083
18/07/19 22:02:23信息配置单元。元存储:已连接到元存储。
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState:创建的本地目录:/tmp/952f73b3-a59c-4a23-a04a-f05dc4e67d89\u资源
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState:创建的HDFS目录:/tmp/hive/root/952f73b3-a59c-4a23-a04a-f05dc4e67d89
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState:创建的本地目录:/tmp/root/952f73b3-a59c-4a23-a04a-f05dc4e67d89
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState:创建的HDFS目录:/tmp/hive/root/952f73b3-a59c-4a23-a04a-f05dc4e67d89/_tmp_space.db
18/07/19 22:02:24信息依赖解析程序:在HIVE_HOME或HIVE_CONF_DIR中找不到ivysettings.xml文件,将使用/etc/HIVE/CONF.dist/ivysettings.xml
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState:创建的本地目录:/tmp/59c3fef5-6c9e-49c9-bf31-69634430e4e6\u资源
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState:创建的HDFS目录:/tmp/hive/root/59c3fef5-6c9e-49c9-bf31-69634430e4e6
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState:创建的本地目录:/tmp/root/59c3fef5-6c9e-49c9-bf31-69634430e4e6
18/07/19 22:02:24 INFO org.apache.hadoop.hive.ql.session.SessionState:创建的HDFS目录:/tmp/hive/root/59c3fef5-6c9e-49c9-bf31-69634430e4e6/_tmp_space.db
回溯(最近一次呼叫最后一次):
文件“/tmp/get\u search\u query\u logs\u 620dea04/download\u elk\u event\u logs.py”,第237行,在
aws\u凭据=aws\u凭据。获取\u冻结的\u凭据()
AttributeError:“非类型”对象没有“获取\u冻结\u凭据”属性
18/07/19 22:02:24 INFO org.spark_project.jetty.server.AbstractConnector:已停止Spark@75b67e54{HTTP/1.1[HTTP/1.1]}{0.0.0.0:4040}
当我尝试在登录到dataproc节点后手动提交作业时,Spark作业正在验证凭据并正常运行


有人能帮我吗?

在linux环境中玩了一段时间,通过boto3会话从环境变量中获取AWS凭据后,我什么也做不到。因此,遵循文档并修改了初始化操作脚本,如下所示:

echo "[default]" | sudo tee --append /root/.aws/config
echo "aws_access_key_id = XXXXXXXX" | sudo tee --append /root/.aws/config
echo "aws_secret_access_key = xxxxxxxx" | sudo tee --append /root/.aws/config

通过
boto3\u会话
访问AWS凭据有几种方法,其中之一是通过
~/.AWS/config

在linux环境中玩了一段时间,通过boto3会话从环境变量中获取AWS凭据后,对我来说没有任何效果。因此,遵循文档并修改了初始化操作脚本,如下所示:

echo "[default]" | sudo tee --append /root/.aws/config
echo "aws_access_key_id = XXXXXXXX" | sudo tee --append /root/.aws/config
echo "aws_secret_access_key = xxxxxxxx" | sudo tee --append /root/.aws/config

通过
boto3\u会话
访问AWS凭据有几种方法,其中之一是通过
~/.AWS/config

您如何将凭据添加到
/etc/environment
以及何时添加?对于Systemd服务,仅向
/etc/environment
添加变量似乎是不够的,您需要实际获取此文件的源代码(即显式地从中提取变量),这里有两个选项可以执行此操作:@IgorDvorzhak在初始化操作的群集创建阶段,我正在将环境变量添加到
/etc/environment
。@IgorDvorzhak:感谢您的帮助:)您如何向
/etc/environment
添加凭据,以及在什么时候?看起来仅仅向中添加变量是不够的
/etc/environment
对于Systemd服务,您需要实际获取此文件的源代码(即显式地从中提取变量),以下是两种方法:@IgorDvorzhak我正在初始化操作的群集创建阶段将环境变量添加到
/etc/environment
。@IgorDvorzhak:谢谢帮助:)