Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/email/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 函数返回Spark中的空列表_Scala_Apache Spark_Functional Programming_Scala Collections - Fatal编程技术网

Scala 函数返回Spark中的空列表

Scala 函数返回Spark中的空列表,scala,apache-spark,functional-programming,scala-collections,Scala,Apache Spark,Functional Programming,Scala Collections,下面是获取压缩文件中文件名列表的代码 def getListOfFilesInRepo(zipFileRDD : RDD[(String,PortableDataStream)]) : (List[String]) = { val zipInputStream = zipFileRDD.values.map(x => new ZipInputStream(x.open)) val filesInZip = new ArrayBuffer[String]() var

下面是获取压缩文件中文件名列表的代码

def getListOfFilesInRepo(zipFileRDD : RDD[(String,PortableDataStream)]) : (List[String]) = {
    val zipInputStream = zipFileRDD.values.map(x => new ZipInputStream(x.open))
    val filesInZip =  new ArrayBuffer[String]()
    var ze : Option[ZipEntry] = None
    zipInputStream.foreach(stream =>{
      do{
        ze = Option(stream.getNextEntry);
        ze.foreach{ze =>
          if(ze.getName.endsWith("java") && !ze.isDirectory()){
            var fileName:String = ze.getName.substring(ze.getName.lastIndexOf("/")+1,ze.getName.indexOf(".java"))
            filesInZip += fileName
          }
        }
        stream.closeEntry()
      } while(ze.isDefined)
      println(filesInZip.toList.length) // print 889 (correct)
    })
    println(filesInZip.toList.length) // print 0 (WHY..?)
    (filesInZip.toList)
  }
我以以下方式执行上述代码:

scala> val zipFileRDD = sc.binaryFiles("./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip")
zipFileRDD: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = ./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip BinaryFileRDD[17] at binaryFiles at <console>:25

scala> getListOfFilesInRepo(zipRDD)
889
0
res12: List[String] = List()
scala>val-zipFileRDD=sc.binaryFiles(“./handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip”)
zipFileRDD:org.apache.spark.rdd.rdd[(String,org.apache.spark.input.PortableDataStream)]=/handsOn/repo~apache~storm~14135470~false~Java~master~2210.zip二进制文件rdd[17]位于:25的二进制文件处
scala>getListOfFilesInRepo(zipRDD)
889
0
res12:List[String]=List()

为什么我没有得到889,反而得到了0?

之所以发生这种情况,是因为
filesInZip
不在工作人员之间共享
foreach
filesInZip
的本地副本进行操作,当它完成时,该副本将被丢弃并进行垃圾收集。如果要保留结果,应使用转换(很可能是
flatMap
)并返回收集的聚合值

def listFiles(stream: PortableDataStream): TraversableOnce[String] = ???

zipInputStream.flatMap(listFiles)
你可以从中学到更多