Scala 理解map函数的操作_Scala_Apache Spark_Rdd

Scala 理解map函数的操作

scala apache-spark

Scala 理解map函数的操作,scala,apache-spark,rdd,Scala,Apache Spark,Rdd,我在Holden Karau的《火花快速处理》一书中遇到了以下示例。我不理解以下代码行在程序中的作用： val splitLines = inFile.map(line => { val reader = new CSVReader(new StringReader(line)) reader.readNext() }) val numericData = splitLines.map(line => line.map(_.toDouble)) val summedData = nu

我在Holden Karau的《火花快速处理》一书中遇到了以下示例。我不理解以下代码行在程序中的作用：

val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)

该计划是：

package pandaspark.examples
import spark.SparkContext
import spark.SparkContext._
import spark.SparkFiles;
import au.com.bytecode.opencsv.CSVReader
import java.io.StringReader
object LoadCsvExample {
  def main(args: Array[String]) {
  if (args.length != 2) {
    System.err.println("Usage: LoadCsvExample <master>
      <inputfile>")
    System.exit(1)
  }

 val master = args(0)
 val inputFile = args(1)
 val sc = new SparkContext(master, "Load CSV Example",
 System.getenv("SPARK_HOME"),
 Seq(System.getenv("JARS")))
 sc.addFile(inputFile)
 val inFile = sc.textFile(inputFile)
 val splitLines = inFile.map(line => {
 val reader = new CSVReader(new StringReader(line))
 reader.readNext()
 })
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)
println(summedData.collect().mkString(","))
}
}

首先，您的代码示例中没有任何

flatMap

操作，因此标题具有误导性。但通常，在集合上调用的

map

返回新集合，并将函数应用于集合的每个元素

逐行浏览代码段：

val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})

infle

的类型是

RDD[String]

。获取每个这样的字符串，从中创建csv阅读器，并调用

readNext

（返回字符串数组）。因此，最后您将得到

RDD[String[]]

val numericData = splitLines.map(line => line.map(_.toDouble))

val summedData = numericData.map(row => row.sum)

一个更复杂的线条，嵌套了2个映射操作。同样，您可以获取RDD集合的每个元素（现在是

String[]

），并将

\uU2ble

函数应用于

String[]

的每个元素。最后，您将获得

RDD[Double[]]

val numericData = splitLines.map(line => line.map(_.toDouble))

val summedData = numericData.map(row => row.sum)

您获取RDD元素并对其应用

sum

函数。由于每个元素都是

Double[]

，因此sum将生成一个

Double

值。最后您将得到

RDD[Double]

val splitLines = inFile.map(line => {
val reader = new CSVReader(new StringReader(line))
reader.readNext()
})
val numericData = splitLines.map(line => line.map(_.toDouble))
val summedData = numericData.map(row => row.sum)

因此，在这段代码中，基本上是读取CSV文件数据并添加其值。假设您的CSV文件类似于-

10,12,13
1,2,3,4
1,2

这里我们从CSV文件中获取数据，如-

val inFile = sc.textFile("your CSV file path")

这里的infle是一个RDD，它有文本格式的数据。当你在上面申请collect时，它会是这样的-

Array[String] = Array(10,12,13 , 1,2,3,4 , 1,2)

当你们在上面画地图的时候，你们会发现-

line = 10,12,13
line = 1,2,3,4
line = 1,2

为了以CSV格式读取数据，它使用-

val reader = new CSVReader(new StringReader(line))
reader.readNext()

所以在以CSV格式读取数据之后，分割线看起来像-

Array(
Array(10,12,13), 
Array(1,2,3,4), 
Array(1,2)
)

在分割线上，它正在应用

splitLines.map(line => line.map(_.toDouble))

在这一行中，您将得到数组（10,12,13），在它之后，它将使用

line.map(_.toDouble)

所以它将所有元素的类型从string改为Double。所以在数字数据中，你会得到相同的结果

Array(Array(10.0, 12.0, 13.0), Array(1.0, 2.0, 3.0, 4.0), Array(1.0, 2.0))

但是现在所有的元素都是双重的

它应用单个行或数组的和，所以回答如下- 阵列（35.0、10.0、3.0）

当你应用susummedData.collect（）时，你会得到它。你能告诉我如果我改用flatMap，输出会有什么变化吗？比如：val splitLines=infle.flatMap（line=>{val reader=new CSVReader（new StringReader（line））reader.readNext（）}）val numericData=splitLines.flatMap（line=>line.map（u.toDouble））val-summedData=numericData.map（row=>row.sum）您能告诉我如果改用flatMap，输出会有什么变化吗？比如：val-splitLines=infle.flatMap（line=>{val-reader=new-CSVReader（new-StringReader（line））reader.readNext（）}）val numericData=splitLines.flatMap（line=>line.map（u.toDouble））val summedData=numericData.map（row=>row.sum）flatMap始终删除内部说唱器。这里我们得到的响应类似于数组（数组（10.0，12.0，13.0），数组（1.0，2.0，3.0，4.0），数组（1.0，2.0）），但是当你应用flatMap而不是map时，你的响应将类似于-Array（10.0、12.0、13.0、1.0、2.0、3.0、4.0、1.0、2.0）您可以在这里看到，因为flatMap的所有内部包装都已移除