Hadoop 在AmazonEC2上使用HDFS和ApacheSpark

Hadoop 在AmazonEC2上使用HDFS和ApacheSpark,hadoop,amazon-web-services,amazon-ec2,apache-spark,hdfs,Hadoop,Amazon Web Services,Amazon Ec2,Apache Spark,Hdfs,我使用spark EC2脚本设置了spark群集。我设置了集群,现在正试图在HDFS上放置一个文件,这样我就可以让集群正常工作 在我的主机上有一个文件data.txt。我通过执行短暂的hdfs/bin/hadoop fs-put data.txt/data.txt 现在,在我的代码中,我有: JavaRDD<String> rdd = sc.textFile("hdfs://data.txt",8); 我只有一个执行器,集群中的工作节点似乎没有参与其中。我认为这是因为我使用的是本地

我使用spark EC2脚本设置了spark群集。我设置了集群,现在正试图在HDFS上放置一个文件,这样我就可以让集群正常工作

在我的主机上有一个文件
data.txt
。我通过执行短暂的hdfs/bin/hadoop fs-put data.txt/data.txt

现在,在我的代码中,我有:

JavaRDD<String> rdd = sc.textFile("hdfs://data.txt",8);

我只有一个执行器,集群中的工作节点似乎没有参与其中。我认为这是因为我使用的是本地文件,而ec2没有NFS

那么您需要在
hdfs://data.txt
是主机名,因此它应该是
hdfs://{active_master}:9000/data.txt
(如果将来有用,持久化hdfs的spark-ec2脚本的默认端口是
9010
).

AWS Elastic Map Reduce现在本机支持Spark,并包括现成的HDFS

请参见中的详细信息和演练

EMR中的Spark使用EMRFS直接访问S3中的数据,而无需 首先将其复制到HDFS中

演练包括从S3加载数据的示例

Exception in thread "main" java.net.UnknownHostException: unknown host: data.txt
    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
    at org.apache.hadoop.ipc.Client.call(Client.java:1050)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
    at com.sun.proxy.$Proxy6.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:123)
    at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:62)
    at org.apache.spark.rdd.RDD.sortBy(RDD.scala:488)
    at org.apache.spark.api.java.JavaRDD.sortBy(JavaRDD.scala:188)
    at SimpleApp.sortBy(SimpleApp.java:118)
    at SimpleApp.main(SimpleApp.java:30)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
JavaRDD<String> rdd = sc.textFile("/home/ec2-user/data.txt",8);
./spark/bin/spark-submit --class SimpleApp --master spark://ec2-xxx.amazonaws.com:7077 --total-executor-cores 8 /home/ec2-user/simple-project-1.0.jar