Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/spring-boot/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用scala在spark应用程序中建立反向索引_Scala_Csv_Apache Spark - Fatal编程技术网

使用scala在spark应用程序中建立反向索引

使用scala在spark应用程序中建立反向索引,scala,csv,apache-spark,Scala,Csv,Apache Spark,我是Spark和scala编程语言的新手。我的输入是一个CSV文件。我需要在csv文件中的值上建立一个反向索引,如下面的示例所述 Input: file.csv attr1, attr2, attr3 1, AAA, 23 2, BBB, 23 3, AAA, 27 output format: value -> (rowid, collumnid) pairs for example: AAA -> ((1,2),(3,2))

我是Spark和scala编程语言的新手。我的输入是一个CSV文件。我需要在csv文件中的值上建立一个反向索引,如下面的示例所述

Input: file.csv

attr1, attr2, attr3
1,     AAA,    23
2,     BBB,    23
3,     AAA,    27

output format: value -> (rowid, collumnid) pairs
for example: AAA -> ((1,2),(3,2))
             27  -> (3,3)
我从以下代码开始。在那之后我陷入困境。请帮忙

object Main {
  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("Invert Me!").setMaster("local[2]")
    val sc = new SparkContext(conf)

    val txtFilePath = "/home/person/Desktop/sample.csv"

    val txtFile = sc.textFile(txtFilePath)
    val nRows = txtFile.count()

    val data = txtFile.map(line => line.split(",").map(elem => elem.trim()))
    val nCols = data.collect()(0).length

  }
}

保留样式的代码看起来像

val header = sc.broadcast(data.first())

val cells = data.zipWithIndex().filter(_._2 > 0).flatMap { case (row, index) =>
  row.zip(header.value).map { case (value, column) => value ->(column, index) }
}


val index: RDD[(String, Vector[(String, Long)])] = 
   cells.aggregateByKey(Vector.empty[(String, Long)])(_ :+ _, _ ++ _)
这里的
索引
值应该包含
单元格值
到配对
(ColumnName,RowIndex)

上述方法中的下划线只是短切的lambda,可以用另一种方式写成

val cellsVerbose = data.zipWithIndex().flatMap {
  case (row, 1) => IndexedSeq.empty // skipping header row
  case (row, index) => row.zip(header.value).map {
    case (value, column) => value ->(column, index)
  }
} 


val indexVerbose: RDD[(String, Vector[(String, Long)])] =
  cellsVerbose.aggregateByKey(zeroValue = Vector.empty[(String, Long)])(
    seqOp = (keys, key) => keys :+ key,
    combOp = (keysA, keysB) => keysA ++ keysB)

由于我是scala的新手,我正在花时间掌握上面代码中的所有下划线。如果有任何进一步的帮助,我会回复您。。thanks@CRM提供了更详细的版本