Apache spark “如何修复”；NullPointerException:projectId不能为null；在GKE上的火花应用？_Apache Spark_Kubernetes_Google Cloud Platform_Google Cloud Storage_Google Kubernetes Engine

Apache spark “如何修复”；NullPointerException:projectId不能为null；在GKE上的火花应用？

apache-spark kubernetes google-cloud-platform google-cloud-storage

Apache spark “如何修复”；NullPointerException:projectId不能为null；在GKE上的火花应用？,apache-spark,kubernetes,google-cloud-platform,google-cloud-storage,google-kubernetes-engine,Apache Spark,Kubernetes,Google Cloud Platform,Google Cloud Storage,Google Kubernetes Engine,我正在将Spark结构化流媒体应用程序部署到Google Kubernetes引擎，在使用gs://URI方案访问bucket时，我面临以下异常： Exception in thread "main" java.lang.NullPointerException: projectId must not be null at com.google.cloud.hadoop.repackaged.gcs.com.google.common.base.Preconditio

我正在将Spark结构化流媒体应用程序部署到Google Kubernetes引擎，在使用

gs://

URI方案访问bucket时，我面临以下异常：

Exception in thread "main" java.lang.NullPointerException: projectId must not be null
    at com.google.cloud.hadoop.repackaged.gcs.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createBucket(GoogleCloudStorageImpl.java:437)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorage.createBucket(GoogleCloudStorage.java:88)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.mkdirsInternal(GoogleCloudStorageFileSystem.java:456)
    at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.mkdirs(GoogleCloudStorageFileSystem.java:444)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.mkdirs(GoogleHadoopFileSystemBase.java:911)
    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:2275)
    at org.apache.spark.sql.execution.streaming.StreamExecution.<init>(StreamExecution.scala:137)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.<init>(MicroBatchExecution.scala:50)
    at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:317)
    at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:359)
    at org.apache.spark.sql.streaming.DataStreamWriter.startQuery(DataStreamWriter.scala:466)
    at org.apache.spark.sql.streaming.DataStreamWriter.startInternal(DataStreamWriter.scala:456)
    at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:301)
    at meetup.SparkStreamsApp$.delayedEndpoint$meetup$SparkStreamsApp$1(SparkStreamsApp.scala:25)
    at meetup.SparkStreamsApp$delayedInit$body.apply(SparkStreamsApp.scala:7)

如何以适当的Kubernetes/GKE方式修复它？

GKE文档中推荐的方法是：

kubectl创建机密的通用spark streaming sa——来自文件=/path/spark-streaming-serviceaccount-key.json

提交作业时，请添加以下配置：

——conf spark.kubernetes.driver.secrets.spark-streaming-sa=
--conf spark.kubernetes.executor.secrets.spark-streaming-sa=
--conf spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS=/spark-streaming-sa.json
--conf spark.executionv.GOOGLE_应用程序_凭证=/spark-streaming-sa.json
--conf spark.hadoop.google.cloud.auth.service.account.json.keyfile=/spark-streaming-sa.json

您可以参考Github上提供的示例

spark docs的机密管理部分也对此进行了说明：

Kubernetes可用于为Spark提供凭证访问安全服务的应用程序。要装载用户指定的将密码放入驱动程序容器中，用户可以使用该配置窗体的属性

spark.kubernetes.driver.secrets.[SecretName]=

。同样地，窗体的配置属性

spark.kubernetes.executor.secrets.[SecretName]=

可以用于将用户指定的机密装载到executor容器中

根据您的配置，我建议您添加以下属性

fs.gs.project.id

，如图所示。因为它显示为

所需。谷歌云项目ID，可访问配置的GCS存储桶

此外，我同意@Blackishop关于秘密管理的声明

./bin/spark-submit \
  --master k8s://$K8S_SERVER \
  --deploy-mode cluster \
  --name $POD_NAME \
  --class meetup.SparkStreamsApp \
  --conf spark.kubernetes.driver.request.cores=400m \
  --conf spark.kubernetes.executor.request.cores=100m \
  --conf spark.kubernetes.container.image=$SPARK_IMAGE \
  --conf spark.kubernetes.driver.pod.name=$POD_NAME \
  --conf spark.kubernetes.namespace=$K8S_NAMESPACE \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.submission.waitAppCompletion=false \
  --conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
  --conf spark.hadoop.google.cloud.auth.service.account.enable=true \
  --verbose \
  local:///opt/spark/jars/meetup.spark-streams-demo-0.1.0.jar $BUCKET_NAME