Pyspark 在spark中读取google bucket数据_Pyspark_Google Cloud Platform_Google Cloud Storage

Pyspark 在spark中读取google bucket数据

pyspark google-cloud-platform google-cloud-storage

Pyspark 在spark中读取google bucket数据,pyspark,google-cloud-platform,google-cloud-storage,Pyspark,Google Cloud Platform,Google Cloud Storage,我跟随这个博客阅读存储在google bucket中的数据。它工作得很好。下面的命令 hadoop fs-ls gs://要列出的bucket 给了我预期的结果。但是当我尝试使用pyspark读取数据时 rdd=sc.textFile（“gs://crawl\u tld\u bucket/”）它抛出以下错误： ` 如何完成？要访问谷歌云存储，您必须包括云存储连接器： spark-submit --jars /path/to/gcs/gcs-connector-latest-hadoop2.j

我跟随这个博客阅读存储在google bucket中的数据。它工作得很好。下面的命令

hadoop fs-ls gs://要列出的bucket

给了我预期的结果。但是当我尝试使用pyspark读取数据时

rdd=sc.textFile（“gs://crawl\u tld\u bucket/”）

它抛出以下错误：

如何完成？

要访问谷歌云存储，您必须包括云存储连接器：

spark-submit --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar your-pyspark-script.py

或

spark-submit --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar your-pyspark-script.py

pyspark --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar