Apache spark 如何使用aws数据管道为spark应用程序正确设置google云存储

Apache spark 如何使用aws数据管道为spark应用程序正确设置google云存储,apache-spark,google-cloud-storage,google-cloud-dataproc,amazon-data-pipeline,spark-submit,Apache Spark,Google Cloud Storage,Google Cloud Dataproc,Amazon Data Pipeline,Spark Submit,我正在设置集群步骤,以使用Amazon数据管道运行spark应用程序。我的工作是从S3读取数据,处理数据并将数据写入google云存储。对于谷歌云存储,我使用的是带有密钥文件的服务帐户。但是,它抱怨在“写入”步骤中找不到密钥文件。我尝试了很多方法,但没有一种有效。如果应用程序在没有数据管道的情况下启动,那么它可以正常运行 以下是我尝试过的: google.cloud.auth.service.account.json.keyfile=“/home/hadoop/gs_test.json” goo

我正在设置集群步骤,以使用Amazon数据管道运行spark应用程序。我的工作是从S3读取数据,处理数据并将数据写入google云存储。对于谷歌云存储,我使用的是带有密钥文件的服务帐户。但是,它抱怨在“写入”步骤中找不到密钥文件。我尝试了很多方法,但没有一种有效。如果应用程序在没有数据管道的情况下启动,那么它可以正常运行

以下是我尝试过的:

google.cloud.auth.service.account.json.keyfile=“/home/hadoop/gs_test.json”

google.cloud.auth.service.account.json.keyfile=“/home/hadoop/gs_test.json”

google.cloud.auth.service.account.json.keyfile=“gs_test.json”

以下是错误:

java.io.FileNotFoundException: /home/hadoop/gs_test.p12 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at com.google.api.client.googleapis.auth.oauth2.GoogleCredential$Builder.setServiceAccountPrivateKeyFromP12File(GoogleCredential.java:670)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromPrivateKeyServiceAccount(CredentialFactory.java:234)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:90)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1816)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1003)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:966)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:113)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:88)
at org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter.<init>(DirectFileOutputCommitter.java:31)
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.getOutputCommitter(FileOutputFormat.java:310)
at org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol.setupCommitter(SQLHadoopMapReduceCommitProtocol.scala:36)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupTask(HadoopMapReduceCommitProtocol.scala:146)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:246)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
java.io.FileNotFoundException:/home/hadoop/gs_test.p12(没有这样的文件或目录)
位于java.io.FileInputStream.open0(本机方法)
在java.io.FileInputStream.open(FileInputStream.java:195)
位于java.io.FileInputStream。(FileInputStream.java:138)
在com.google.api.client.googleapis.auth.oauth2.GoogleCredential$Builder.SetServiceAccountPrivateKeyFromp12文件(GoogleCredential.java:670)上
在com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromPrivateKeyServiceAccount上(CredentialFactory.java:234)
位于com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:90)
位于com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1816)
位于com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1003)
位于com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:966)
位于org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717)
位于org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
位于org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751)
位于org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733)
位于org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377)
位于org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
位于org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.(FileOutputCommitter.java:113)
位于org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.(FileOutputCommitter.java:88)
位于org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter。(DirectFileOutputCommitter.java:31)
位于org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.getOutputCommitter(FileOutputFormat.java:310)
位于org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol.setupCommitter(SQLHadoopMapReduceCommitProtocol.scala:36)
位于org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupTask(HadoopMapReduceCommitProtocol.scala:146)
位于org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:246)
位于org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
位于org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
位于org.apache.spark.scheduler.Task.run(Task.scala:108)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:335)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
运行(Thread.java:748)

您知道如何使用aws数据管道为spark应用程序正确设置google云存储吗?非常感谢您的帮助。

如果我理解得很好:您希望在Dataproc之外的Spark作业中使用GCS(gs://类型URL)

在这种情况下,您必须安装GCS连接器,使gs://url映射器可用:


以上Github链接中的安装和设置说明。

您找到解决方案了吗?这里也有同样的问题,希望听听解决方案。
command-runner.jar,spark-submit,--master,yarn,--deploy-mode,client,--jars,/home/hadoop/appHelper.jar,--num-executors,5,--executor-cores,3,--executor-memory,6G,--name,MyApp,--files,/home/hadoop/gs_test.json, /home/hadoop/app.jar,s3://myBucket/app.conf
command-runner.jar,spark-submit,--master,yarn,--deploy-mode,client,--jars,/home/hadoop/appHelper.jar,--num-executors,5,--executor-cores,3,--executor-memory,6G,--name,MyApp,--files,/home/hadoop/gs_test.json#gs_test.json, /home/hadoop/app.jar,s3://myBucket/app.conf
java.io.FileNotFoundException: /home/hadoop/gs_test.p12 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at com.google.api.client.googleapis.auth.oauth2.GoogleCredential$Builder.setServiceAccountPrivateKeyFromP12File(GoogleCredential.java:670)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromPrivateKeyServiceAccount(CredentialFactory.java:234)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:90)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1816)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1003)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:966)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:113)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:88)
at org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter.<init>(DirectFileOutputCommitter.java:31)
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.getOutputCommitter(FileOutputFormat.java:310)
at org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol.setupCommitter(SQLHadoopMapReduceCommitProtocol.scala:36)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupTask(HadoopMapReduceCommitProtocol.scala:146)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:246)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)