Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 火花问题的GCS接头-获取铲斗时出错_Java_Apache Spark_Google Cloud Platform - Fatal编程技术网

Java 火花问题的GCS接头-获取铲斗时出错

Java 火花问题的GCS接头-获取铲斗时出错,java,apache-spark,google-cloud-platform,Java,Apache Spark,Google Cloud Platform,我想让Spark将数据导出到Google云存储,而不是保存在HDFS上。为了实现这一点,我安装了。下面是Spark上下文中的示例代码,我使用它将数据帧保存到bucket中: val someDF = Seq( (8, "bat"), (64, "mouse"), (-27, null) ).toDF("number", "word") val conf = sc.hadoopConfiguration conf.set("fs.AbstractFileSystem.gs.impl"

我想让Spark将数据导出到Google云存储,而不是保存在HDFS上。为了实现这一点,我安装了。下面是Spark上下文中的示例代码,我使用它将数据帧保存到bucket中:

val someDF = Seq(
  (8, "bat"),
  (64, "mouse"),
  (-27, null)
).toDF("number", "word")

val conf = sc.hadoopConfiguration
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
conf.set("fs.gs.project.id", PROJECT_ID)
conf.set("fs.gs.auth.service.account.enable", "true")
conf.set("fs.gs.auth.service.account.json.keyfile", LOCATION_TO_KEY.json)

someDF
  .write
  .format("parquet")
  .mode("overwrite")
  .save(s"gs://BUCKET_GLOBAL_IDENTIFIER/A_FOLDER_IN_A_BUCKET/)
执行代码后,我收到一个相当神秘的异常:

java.io.IOException: Error getting 'BUCKET_GLOBAL_IDENTIFIER' bucket
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$8.onFailure(GoogleCloudStorageImpl.java:1633)
  at com.google.cloud.hadoop.gcsio.BatchHelper.execute(BatchHelper.java:183)
  at com.google.cloud.hadoop.gcsio.BatchHelper.lambda$queue$0(BatchHelper.java:163)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createJsonResponseException(GoogleCloudStorageExceptions.java:82)
  at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$8.onFailure(GoogleCloudStorageImpl.java:1624)
  ... 6 more
有谁能给我一个解决这个问题的线索吗?以下是我已经解决的问题列表,以达到这一点:

  • Spark无法访问该密钥。问题在于,Spark运行的物理节点上不可用
  • 用于Spark连接器的GCS服务帐户没有创建存储桶的权限。通过将数据保存到已存在的存储桶中,问题得以解决

您似乎错过了该行末尾的双引号。请保存

你有:

save(s"gs://BUCKET_GLOBAL_IDENTIFIER/A_FOLDER_IN_A_BUCKET/)
但正确的答案是:

save(s"gs://BUCKET_GLOBAL_IDENTIFIER/A_FOLDER_IN_A_BUCKET/")
错误在于,变量被视为字符串而不是变量,很可能是因为缺少引号