Google cloud platform 如何使用pyspark writestream写入存储google云？_Google Cloud Platform_Pyspark_Spark Streaming

Google cloud platform 如何使用pyspark writestream写入存储google云？

google-cloud-platform pyspark

Google cloud platform 如何使用pyspark writestream写入存储google云？,google-cloud-platform,pyspark,spark-streaming,Google Cloud Platform,Pyspark,Spark Streaming,我试图从pyspark stream写入GCP存储。代码如下： df_test\ .writeStream.format("parquet")\ .option("path","gs://{my_bucketname}/test")\ .option("checkpointLocation", "gs://{my_checkpointBucket}/checkpoint")\ .start()\

我试图从pyspark stream写入GCP存储。
代码如下：

df_test\
.writeStream.format("parquet")\
.option("path","gs://{my_bucketname}/test")\
.option("checkpointLocation", "gs://{my_checkpointBucket}/checkpoint")\
.start()\
.awaitTermination()

但我有一个错误：

20/11/15 16:37:59 WARN CheckpointFileManager: Could not use FileContext API 
for managing Structured Streaming checkpoint files at gs://name- 
bucket/test/_spark_metadata
 .Using FileSystem API instead for managing log files. If the implementation 
 of FileSystem.rename() is not atomic, then the correctness and fault- 
 tolerance ofyour Structured Streaming is not guaranteed.
 Traceback (most recent call last):
 File "testgcp.py", line 40, in <module>
 .option("checkpointLocation", "gs://check_point_bucket/checkpoint")\
 File "/home/naya/anaconda3/lib/python3.6/site- 
 packages/pyspark/sql/streaming.py", line 1105, in start
 return self._sq(self._jwrite.start())
 File "/home/naya/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py", 
 line 1257, in __call__
 answer, self.gateway_client, self.target_id, self.name)
 File "/home/naya/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py", 
 line 63, in deco
 return f(*a, **kw)
 File "/home/naya/anaconda3/lib/python3.6/site-packages/py4j/protocol.py", 
 line 328, in get_return_value
 format(target_id, ".", name), value)
 py4j.protocol.Py4JJavaError: An error occurred while calling o55.start.
 : java.io.IOException: No FileSystem for scheme: gs

20/11/15 16:37:59警告检查点FileManager:无法使用FileContext API
用于在gs://name处管理结构化流式处理检查点文件-
bucket/test/\u spark\u元数据
。使用文件系统API来管理日志文件。如果实施
rename（）不是原子的，那么正确性和错误-
不能保证结构化流媒体的容忍度。
回溯（最近一次呼叫最后一次）：
文件“testgcp.py”，第40行，在
.选项（“检查点位置”、“gs://check\u point\u bucket/checkpoint”）\
文件“/home/naya/anaconda3/lib/python3.6/site-
packages/pyspark/sql/streaming.py”，第1105行，开始
返回self.\u sq（self.\u jwrite.start（））
文件“/home/naya/anaconda3/lib/python3.6/site packages/py4j/java_gateway.py”，
第1257行，正在通话__
回答，self.gateway\u客户端，self.target\u id，self.name）
文件“/home/naya/anaconda3/lib/python3.6/site packages/pyspark/sql/utils.py”，
63号线，装饰风格
返回f（*a，**kw）
文件“/home/naya/anaconda3/lib/python3.6/site packages/py4j/protocol.py”，
第328行，在get_return_值中
格式（目标id，“.”，名称），值）
py4j.protocol.Py4JJavaError:调用o55.start时出错。
：java.io.IOException:没有scheme:gs的文件系统

正确的语法应该是什么？

。似乎首先需要使用正确的身份验证设置spark.conf

spark.conf.set（“google.cloud.auth.service.account.enable”，“true”）
spark.conf.set（“google.cloud.auth.service.account.email”，“您的服务\电子邮件”）
spark.conf.set（“google.cloud.auth.service.account.keyfile”，“path/to/your/files”）

然后，您可以使用读取功能访问bucket中的文件

df=spark.read.option（“header”，True）.csv（“gs://bucket\u name/path\u to\u your\u file.csv”）
df.show（）