Hadoop 无法通过PySpark访问HDFS中的文件_Hadoop_Apache Spark_Amazon Ec2_Pyspark_Hdfs

Hadoop 无法通过PySpark访问HDFS中的文件

hadoop apache-spark amazon-ec2 pyspark

Hadoop 无法通过PySpark访问HDFS中的文件,hadoop,apache-spark,amazon-ec2,pyspark,hdfs,Hadoop,Apache Spark,Amazon Ec2,Pyspark,Hdfs,我是Spark和Hadoop的新手。我正在尝试使用Spark 2.0设置EC2群集我将一个文件复制到短暂的HDFS中，并可以使用 cd../ 以下是我提交的python代码： import sys import numpy as np from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession\ .builder\ .appName("Mat

我是Spark和Hadoop的新手。我正在尝试使用Spark 2.0设置EC2群集

我将一个文件复制到短暂的HDFS中，并可以使用 cd../

以下是我提交的python代码：

import sys

import numpy as np
from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession\
        .builder\
        .appName("MatrixMult")\
        .getOrCreate()

    df = spark.read.option("header","true").csv("hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000/root/input.csv")

    df.show(10)

    spark.close()

My hadoop core-site.xml具有以下集合：

<property>
  <name>fs.default.name</name>
  <value>hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000</value>
</property>

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000</value>
</property>


fs.default.name
hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000
fs.defaultFS
hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000

以下是我提交作业时遇到的错误：

Traceback (most recent call last):
  File "/root/python_code/matrix_mult.py", line 12, in <module>
    df = spark.read.option("header","true").csv("hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000/root/input.csv")
  File "/root/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 380, in csv
  File "/root/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/root/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/root/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o32.csv.
: java.io.IOException: Failed on local exception: java.io.IOException: Broken pipe; Host Details : local host is: "ip-172-31-58-53.ec2.internal/172.31.58.53"; destination host is: "ec2-54-144-193-191.compute-1.amazonaws.com":9000; 
...

回溯（最近一次呼叫最后一次）：
文件“/root/python\u code/matrix\u mult.py”，第12行，在
df=spark.read.option（“header”，“true”）.csv（“hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000/root/input.csv")
文件“/root/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/readwriter.py”，第380行，csv格式
文件“/root/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”，第1133行，在__
文件“/root/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py”，第63行，deco格式
文件“/root/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py”，第319行，在get\u返回值中
py4j.protocol.Py4JJavaError:调用o32.csv时出错。
：java.io.IOException:本地异常失败：java.io.IOException:管道破裂；主机详细信息：本地主机为：“ip-172-31-58-53.ec2.internal/172.31.58.53”；目标主机是：“ec2-54-144-193-191.compute-1.amazonaws.com”：9000；
...

知道为什么会这样吗？有关于如何调试它的提示吗？我尝试使用内部名称，但也不起作用。提前感谢。

我想您只需将fs.defaultFS或fs.default.name:my core-site.xml配置为：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:8020</value>
    </property>
</configuration>


fs.defaultFS
hdfs://master:8020

原因很愚蠢。我使用的是从Apache下载的预编译二进制文件。它希望您拥有Hadoop2。运行EC2脚本时，必须传递标志--hadoop major version=2。我没有那样做

我用这个标志重新构建了集群，它解决了这个问题。

#/usr/bin/python

在顶部？

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:8020</value>
    </property>
</configuration>