Eclipse 在apache Spark中读取本地Windows文件

Eclipse 在apache Spark中读取本地Windows文件,eclipse,scala,apache-spark,Eclipse,Scala,Apache Spark,我试图在本地使用spark。我的环境是 Eclipse Luna支持预构建scala 创建了一个项目并转换为maven并添加了Spark core Dependence Jar 下载WinUtils.exe并设置HADOOP_主路径 我试图运行的代码是 object HelloWorld { def main(args: Array[String]) { println("Hello, world!") /* val master = arg

我试图在本地使用spark。我的环境是

  • Eclipse Luna支持预构建scala
  • 创建了一个项目并转换为maven并添加了Spark core Dependence Jar
  • 下载WinUtils.exe并设置HADOOP_主路径 我试图运行的代码是

    object HelloWorld {
            def main(args: Array[String]) {
              println("Hello, world!")
        /*      val master = args.length match {
                case x: Int if x > 0 => args(0)
                case _ => "local"
              }*/
              /*val sc = new SparkContext(master, "BasicMap", System.getenv("SPARK_HOME"))*/
              val conf = new SparkConf().setAppName("HelloWorld").setMaster("local[2]").set("spark.executor.memory","1g")
              val sc = new SparkContext(conf)
             val input =  sc.textFile("C://Users//user name//Downloads//error.txt")
        // Split it up into words.
        val words = input.flatMap(line => line.split(" "))
        // Transform into pairs and count.
        val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
              counts.foreach(println)
    
    但当我使用sparkContext读取文件时,它失败了,错误如下:

    Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Users/Downloads/error.txt
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
    at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
    at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:289)
    at com.examples.HelloWorld$.main(HelloWorld.scala:23)
    at com.examples.HelloWorld.main(HelloWorld.scala)
    

    有人能告诉我如何克服这个错误吗?

    问题是用户名有空间,这造成了所有问题。一旦我移动到没有空格的filepath,它就工作得很好

    我在w10上成功了 火花2 作为 在sparksession.builder()中 .config(“spark.sql.warehouse.dir”,“文件://”)

    以及在路径中使用\的


    ps请确保将文件的完整扩展名放在您的路径上是否有cygwin?@user52045不,我没有cygwin。我非常确定您需要它。我在网上搜索过,许多人建议停止为spark安装cygwin。相反,他们要求使用winutils.exe.Btw,文件路径中的斜杠很糟糕。如果您使用linux斜杠,您就需要一个。Yo你只需要逃出窗口一次。