Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 如何使用Spark从HDFS读取文件?_Java_Apache Spark_Hadoop - Fatal编程技术网

Java 如何使用Spark从HDFS读取文件?

Java 如何使用Spark从HDFS读取文件?,java,apache-spark,hadoop,Java,Apache Spark,Hadoop,我已经使用ApacheSpark构建了一个推荐系统,其中的数据集存储在我的项目文件夹中,现在我需要从HDFS访问这些文件 如何使用Spark从HDFS读取文件 以下是我初始化Spark会话的方式: SparkContext context = new SparkContext(new SparkConf().setAppName("spark-ml").setMaster("local") .set("fs.default.name", "hdfs://local

我已经使用ApacheSpark构建了一个推荐系统,其中的数据集存储在我的项目文件夹中,现在我需要从HDFS访问这些文件

如何使用Spark从HDFS读取文件

以下是我初始化Spark会话的方式:

SparkContext context = new SparkContext(new SparkConf().setAppName("spark-ml").setMaster("local")
                .set("fs.default.name", "hdfs://localhost:54310").set("fs.defaultFS", "hdfs://localhost:54310"));
        Configuration conf = context.hadoopConfiguration();
        conf.addResource(new Path("/usr/local/hadoop-3.1.2/etc/hadoop/core-site.xml"));
        conf.addResource(new Path("/usr/local/hadoop-3.1.2/etc/hadoop/hdfs-site.xml"));
        conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
        conf.set("fs.hdfs.impl", "org.apache.hadoop.fs.LocalFileSystem");
        this.session = SparkSession.builder().sparkContext(context).getOrCreate();
        System.out.println(conf.getRaw("fs.default.name"));
        System.out.println(context.getConf().get("fs.defaultFS"));
所有输出返回hdfs://localhost:54310 这是我的HDFS的正确uri

尝试从HDFS读取文件时:

session.read().option("header", true).option("inferSchema", true).csv("hdfs://localhost:54310/recommendation_system/movies/ratings.csv").cache();
我得到这个错误:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:54310/recommendation_system/movies/ratings.csv, expected: file:///
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:730)
    at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:86)
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:636)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:930)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:631)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
    at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:65)
    at org.apache.hadoop.fs.Globber.doGlob(Globber.java:281)
    at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2034)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
    at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:253)
    at scala.Option.getOrElse(Option.scala:138)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:253)
    at scala.Option.getOrElse(Option.scala:138)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
    at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:945)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
    at org.apache.spark.api.java.JavaRDDLike.collect(JavaRDDLike.scala:361)
    at org.apache.spark.api.java.JavaRDDLike.collect$(JavaRDDLike.scala:360)
    at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
    at com.dastamn.sparkml.analytics.SparkManager.<init>(SparkManager.java:36)
    at com.dastamn.sparkml.Main.main(Main.java:22)

我能做些什么来解决这个问题?

粘贴的代码片段中有几点: 1.当必须将hadoop属性设置为使用SparkConf的一部分时,它必须以spark.hadoop.作为前缀。在这种情况下,键fs.default.name需要设置为spark.hadoop.fs.default.name,其他属性也需要设置为spark.hadoop.fs.default.name。 2.csv函数的参数不必说明HDFS端点,Spark将从默认属性中找出它,因为它已经设置好了

session.read.optionheader,true.optioninferSchema, true.csv/Recommension_system/movies/ratings.csv.cache

如果默认文件系统属性不是Hadoop配置的一部分,则Spark/Hadoop需要完整的URI来确定要使用的文件系统。 此外,不使用对象名conf 3.在上面的例子中,看起来Hadoop无法找到hdfs://URI前缀的文件系统,并使用默认的本地文件系统,因为它正在使用rawLocalFileSystems来处理路径。
确保hadoop-hdfs.jar存在于具有DistributedFileSystem的类路径中,以便为hdfs安装FS对象

以下是解决问题的配置:

SparkContext context = new SparkContext(new SparkConf().setAppName("spark-ml").setMaster("local[*]")
                .set("spark.hadoop.fs.default.name", "hdfs://localhost:54310").set("spark.hadoop.fs.defaultFS", "hdfs://localhost:54310")
                .set("spark.hadoop.fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName())
                .set("spark.hadoop.fs.hdfs.server", org.apache.hadoop.hdfs.server.namenode.NameNode.class.getName())
                .set("spark.hadoop.conf", org.apache.hadoop.hdfs.HdfsConfiguration.class.getName()));
        this.session = SparkSession.builder().sparkContext(context).getOrCreate();

你用的是什么版本的Spark?我用的是Spark 2.4.2和Hadoop 3.1.0