Java 火花问题的GCS接头-获取铲斗时出错
我想让Spark将数据导出到Google云存储,而不是保存在HDFS上。为了实现这一点,我安装了。下面是Spark上下文中的示例代码,我使用它将数据帧保存到bucket中:Java 火花问题的GCS接头-获取铲斗时出错,java,apache-spark,google-cloud-platform,Java,Apache Spark,Google Cloud Platform,我想让Spark将数据导出到Google云存储,而不是保存在HDFS上。为了实现这一点,我安装了。下面是Spark上下文中的示例代码,我使用它将数据帧保存到bucket中: val someDF = Seq( (8, "bat"), (64, "mouse"), (-27, null) ).toDF("number", "word") val conf = sc.hadoopConfiguration conf.set("fs.AbstractFileSystem.gs.impl"
val someDF = Seq(
(8, "bat"),
(64, "mouse"),
(-27, null)
).toDF("number", "word")
val conf = sc.hadoopConfiguration
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
conf.set("fs.gs.project.id", PROJECT_ID)
conf.set("fs.gs.auth.service.account.enable", "true")
conf.set("fs.gs.auth.service.account.json.keyfile", LOCATION_TO_KEY.json)
someDF
.write
.format("parquet")
.mode("overwrite")
.save(s"gs://BUCKET_GLOBAL_IDENTIFIER/A_FOLDER_IN_A_BUCKET/)
执行代码后,我收到一个相当神秘的异常:
java.io.IOException: Error getting 'BUCKET_GLOBAL_IDENTIFIER' bucket
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$8.onFailure(GoogleCloudStorageImpl.java:1633)
at com.google.cloud.hadoop.gcsio.BatchHelper.execute(BatchHelper.java:183)
at com.google.cloud.hadoop.gcsio.BatchHelper.lambda$queue$0(BatchHelper.java:163)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createJsonResponseException(GoogleCloudStorageExceptions.java:82)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$8.onFailure(GoogleCloudStorageImpl.java:1624)
... 6 more
有谁能给我一个解决这个问题的线索吗?以下是我已经解决的问题列表,以达到这一点:
- Spark无法访问该密钥。问题在于,Spark运行的物理节点上不可用
- 用于Spark连接器的GCS服务帐户没有创建存储桶的权限。通过将数据保存到已存在的存储桶中,问题得以解决
save(s"gs://BUCKET_GLOBAL_IDENTIFIER/A_FOLDER_IN_A_BUCKET/)
但正确的答案是:
save(s"gs://BUCKET_GLOBAL_IDENTIFIER/A_FOLDER_IN_A_BUCKET/")
错误在于,变量被视为字符串而不是变量,很可能是因为缺少引号