Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 从AWS S3读取的pyspark文件不工作_Apache Spark_Hadoop_Amazon S3_Pyspark - Fatal编程技术网

Apache spark 从AWS S3读取的pyspark文件不工作

Apache spark 从AWS S3读取的pyspark文件不工作,apache-spark,hadoop,amazon-s3,pyspark,Apache Spark,Hadoop,Amazon S3,Pyspark,我使用brew安装了spark和hadoop: brew info hadoop #=> hadoop: stable 3.1.2 brew info apache-spark #=> apache-spark: stable 2.4.4 我现在尝试加载s3上托管的csv文件,尝试了许多不同的方法但都没有成功(以下是其中之一): 我得到这个错误: 19/09/17 14:34:52 WARN FileStreamSink: Error while looking for metad

我使用brew安装了spark和hadoop:

brew info hadoop #=> hadoop: stable 3.1.2
brew info apache-spark #=> apache-spark: stable 2.4.4
我现在尝试加载s3上托管的csv文件,尝试了许多不同的方法但都没有成功(以下是其中之一):

我得到这个错误:

19/09/17 14:34:52 WARN FileStreamSink: Error while looking for metadata directory.
Traceback (most recent call last):
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/ingestion/clients/aws_s3.py", line 51, in <module>
    df = sql.read.csv('s3a://pilo/fi/data_2014_1.csv')
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/pyspark/sql/readwriter.py", line 476, in csv
    return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.immutable.List.flatMap(List.scala:355)
        at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:618)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.base/java.lang.Thread.run(Thread.java:835)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
        ... 30 more
19/09/17 14:34:52警告FileStreamSink:查找元数据目录时出错。
回溯(最近一次呼叫最后一次):
文件“/Users/cyrusghazanfar/anaconda3/lib/python3.6/runpy.py”,第193行,作为主模块运行
“\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
文件“/Users/cyrusghazanfar/anaconda3/lib/python3.6/runpy.py”,第85行,在运行代码中
exec(代码、运行\全局)
文件“/Users/cyrusghazanfar/Desktop/startup studio/pilota_project/pilota_ml/inspection/clients/aws_s3.py”,第51行,在
df=sql.read.csv('s3a://pilo/fi/data\u 2014\u 1.csv')
csv格式的文件“/Users/cyrusghazanfar/Desktop/startup studio/pilota_project/pilota_ml/env/lib/python3.6/site packages/pyspark/sql/readwriter.py”,第476行
返回self.\u df(self.\u jreader.csv(self.\u spark.\u sc.\u jvm.PythonUtils.toSeq(path)))
文件“/Users/cyrusghazanfar/Desktop/startup studio/pilota_project/pilota_ml/env/lib/python3.6/site packages/py4j/java_gateway.py”,第1257行,在__
回答,self.gateway\u客户端,self.target\u id,self.name)
文件“/Users/cyrusghazanfar/Desktop/startup studio/pilota_project/pilota_ml/env/lib/python3.6/site packages/pyspark/sql/utils.py”,第63行,装饰
返回f(*a,**kw)
文件“/Users/cyrusghazanfar/Desktop/startup studio/pilota_project/pilota_ml/env/lib/python3.6/site packages/py4j/protocol.py”,第328行,在get_return_值中
格式(目标id,“.”,名称),值)
py4j.protocol.Py4JJavaError:调用o26.csv时出错。
:java.lang.RuntimeException:java.lang.ClassNotFoundException:Class org.apache.hadoop.fs.s3a.S3AFileSystem未找到
位于org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
位于org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
位于org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
位于org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
位于org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
位于org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
位于org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
位于org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
位于org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$datasources$$checkandglobpathif-needed$1.apply(DataSource.scala:547)
在org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$datasources$$checkandglobpathif needed$1.apply(DataSource.scala:545)
位于scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
位于scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
位于scala.collection.immutable.List.foreach(List.scala:392)
位于scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
位于scala.collection.immutable.List.flatMap(List.scala:355)
在org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$datasources$DataSource$$checkandglobpathif needed(DataSource.scala:545)
位于org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
位于org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
位于org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
位于org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:618)
位于java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(本机方法)
位于java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
位于java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
位于java.base/java.lang.reflect.Method.invoke(Method.java:567)
位于py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
位于py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
在py4j.Gateway.invoke处(Gateway.java:282)
位于py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
在py4j.commands.CallCommand.execute(CallCommand.java:79)
在py4j.GatewayConnection.run处(GatewayConnection.java:238)
位于java.base/java.lang.Thread.run(Thread.java:835)
原因:java.lang.ClassNotFoundException:Class org.apache.hadoop.fs.s3a.S3AFileSystem未找到
位于org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
位于org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 30多
看起来它与我的aws s3凭据有关,但不确定如何设置它。(目前我的awe证书在我的bash_档案中)请帮助。

也许这个isprabin的解决方案会有所帮助

将以下内容添加到此文件“hadoop/etc/hadoop/core site.xml”


谢谢你。我想Hadoop在/usr/local/cillar/Hadoop/3.1.2下,配置文件在/usr/local/cillar/Hadoop/3.1.2/libexec/etc/Hadoop/下。
19/09/17 14:34:52 WARN FileStreamSink: Error while looking for metadata directory.
Traceback (most recent call last):
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/ingestion/clients/aws_s3.py", line 51, in <module>
    df = sql.read.csv('s3a://pilo/fi/data_2014_1.csv')
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/pyspark/sql/readwriter.py", line 476, in csv
    return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/Users/cyrusghazanfar/Desktop/startup-studio/pilota_project/pilota_ml/env/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.immutable.List.flatMap(List.scala:355)
        at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
        at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:618)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.base/java.lang.Thread.run(Thread.java:835)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
        ... 30 more
<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>***</value>
</property>
<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>***</value>
</property>
sudo cp hadoop/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar hadoop/share/hadoop/common/lib/

sudo cp hadoop/share/hadoop/tools/lib/hadoop-aws-2.7.5.jar hadoop/share/hadoop/common/lib/