Amazon s3 当PySpark在Google Colab环境中读取Aws-S3时引发ClassNotFoundException

Amazon s3 当PySpark在Google Colab环境中读取Aws-S3时引发ClassNotFoundException,amazon-s3,pyspark,google-colaboratory,Amazon S3,Pyspark,Google Colaboratory,Google Colab research是测试python、数据挖掘和深度学习的绝佳工具,我想基于pyspark在其上运行spark作业,我在读取Google Colab pyspark脚本中的S3时发现了错误: 获取返回值(应答、网关客户端、目标id、名称)中的/usr/local/lib/python3.6/dist-packages/py4j/protocol.py 326 raise Py4JJavaError( 327“调用{0}{1}{2}时出错。\n”。 -->328格式(目标i

Google Colab research是测试python、数据挖掘和深度学习的绝佳工具,我想基于pyspark在其上运行spark作业,我在读取Google Colab pyspark脚本中的S3时发现了错误:

获取返回值(应答、网关客户端、目标id、名称)中的
/usr/local/lib/python3.6/dist-packages/py4j/protocol.py
326 raise Py4JJavaError(
327“调用{0}{1}{2}时出错。\n”。
-->328格式(目标id,“.”,名称),值)
329其他:
330升起Py4JError(
Py4JJavaError:调用o29.json时出错。
:java.lang.RuntimeException:java.lang.ClassNotFoundException:Class org.apache.hadoop.fs.s3a.S3AFileSystem未找到
…还有25个
  • 创建新笔记本并安装pyspark
  • 首先将aws相关的jar下载到pyspark插件文件夹中,这里的
    /usr/local/lib/python3.6/dist packages
    是我的python站点包文件夹,您可以通过
    导入站点;site.getsitepackages()
  • 配置pyspark会话和上下文
  • 从s3读取
  • #从s3服务器读取
    df=spark.read.json('s3a://path/test*.json');
    df.show()
    
    输出

    +---+---+----+
    |  A|  B|   C|
    +---+---+----+
    |  1|  2|null|
    |  1|  2|   3|
    |  1|  2|null|
    |  1|  2|null|
    +---+---+----+
    
    喜欢男人

    ! cd /usr/local/lib/python3.6/dist-packages/pyspark/jars && wget 'https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar'
    ! cd /usr/local/lib/python3.6/dist-packages/pyspark/jars && wget 'https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar'
    ! cd /usr/local/lib/python3.6/dist-packages/pyspark/jars && wget 'https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar'
    
    
    SPARK_HOME = '/usr/local/lib/python3.6/dist-packages/pyspark/'
    
    spark = SparkSession.\
            builder.\
            appName("pyspark-test").\
            config("spark.driver.extraClassPath", "{0}/jars/hadoop-aws-2.7.4.jar:{0}/jars/aws-java-sdk-1.7.4.jar".format(SPARK_HOME)).\
            config("spark.executor.extraClassPath", "{0}/jars/hadoop-aws-2.7.4.jar:{0}/jars/aws-java-sdk-1.7.4.jar".format(SPARK_HOME)).\
            getOrCreate()
    
    # s3 configuration (actually i'm using digital ocean space which is s3 alternative)
    spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "${access_key}")
    spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "${secret_key}")
    spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "${endpoint}")
    
    +---+---+----+
    |  A|  B|   C|
    +---+---+----+
    |  1|  2|null|
    |  1|  2|   3|
    |  1|  2|null|
    |  1|  2|null|
    +---+---+----+