将数组转换为Scala中具有列和索引的数据帧_Scala_Apache Spark Sql

将数组转换为Scala中具有列和索引的数据帧

scala

将数组转换为Scala中具有列和索引的数据帧,scala,apache-spark-sql,Scala,Apache Spark Sql,最初我有一个矩阵 0.0 0.4 0.4 0.0 0.1 0.0 0.0 0.7 0.0 0.2 0.0 0.3 0.3 0.0 0.0 0.0 矩阵矩阵通过 `val normal_array = matrix.toArray` 我有一个字符串数组 inputCols : Array[String] = Array(p1, p2, p3, p4) 我需要将这个矩阵转换成下面的数据帧。（注意：矩阵中的行数和列数将与输入的长度相同）在python中

最初我有一个矩阵

 0.0  0.4  0.4  0.0 
 0.1  0.0  0.0  0.7 
 0.0  0.2  0.0  0.3 
 0.3  0.0  0.0  0.0

矩阵

矩阵

通过

`val normal_array = matrix.toArray`

我有一个字符串数组

inputCols : Array[String] = Array(p1, p2, p3, p4)

我需要将这个矩阵转换成下面的数据帧。（注意：矩阵中的行数和列数将与

输入的长度相同

）

在python中，这可以通过

pandas

库轻松实现

arrayToDataframe = pandas.DataFrame(normal_array,columns = inputCols, index = inputCols)

但是我如何在Scala中做到这一点呢？

您可以像下面这样做

 //convert your data to Scala Seq/List/Array

 val list = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))

  //Define your Array of desired columns

  val inputCols : Array[String] = Array("p1", "p2", "p3", "p4")

  //Create DataFrame from given data, It will create dataframe with its own column names like _c1,_c2 etc

  val df = sparkSession.createDataFrame(list)

  //Getting the list of column names from dataframe

  val dfColumns=df.columns

  //Creating query to rename columns

  val query=inputCols.zipWithIndex.map(index=>dfColumns(index._2)+" as "+inputCols(index._2))

  //Firing above query  

  val newDf=df.selectExpr(query:_*)

 //Creating udf which get index(0,1,2,3) as input and returns corresponding column name from your given array of columns

  val getIndexUDF=udf((row_no:Int)=>inputCols(row_no))

  //Adding temporary column row_no which contains index of row and removing after adding index column

  val dfWithRow=newDf.withColumn("row_no",monotonicallyIncreasingId).withColumn("index",getIndexUDF(col("row_no"))).drop("row_no")

  dfWithRow.show

样本输出：

+---+---+---+---+-----+
| p1| p2| p3| p4|index|
+---+---+---+---+-----+
|0.0|0.4|0.4|0.0|   p1|
|0.1|0.0|0.0|0.7|   p2|
|0.0|0.2|0.0|0.3|   p3|
|0.3|0.0|0.0|0.0|   p4|
+---+---+---+---+-----+

还有一种方法：

val data = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
val cols = Array("p1", "p2", "p3", "p4","index")

压缩集合并将其转换为DataFrame

data.zip(cols).map { 
  case (col,index) => (col._1,col._2,col._3,col._4,index)
}.toDF(cols: _*)

输出：

+---+---+---+---+-----+
|p1 |p2 |p3 |p4 |index|
+---+---+---+---+-----+
|0.0|0.4|0.4|0.0|p1   |
|0.1|0.0|0.0|0.7|p2   |
|0.0|0.2|0.0|0.3|p3   |
|0.3|0.0|0.0|0.0|p4   |
+---+---+---+---+-----+

较新和较短的版本应该是对于Spark版本>2.4.5。请查找语句的内联描述

 val spark = SparkSession.builder()
      .master("local[*]")
      .getOrCreate()
 import spark.implicits._
 val cols = (1 to 4).map( i => s"p$i")

    val listDf = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
      .toDF(cols: _*)   // Map the data to new column names
      .withColumn("index",   // Create a column with auto increasing id
        functions.concat(functions.lit("p"),functions.monotonically_increasing_id())) 

    listDf.show()

您能提供您在下面的评论中解释的需求的示例输出吗？是的，我已经用示例编辑了我的问题。用Python做这件事非常简单。我用Scala提供了我的解决方案。我喜欢这个

单调递增ID

-生成的ID保证单调递增且唯一，但不是连续的。

 val spark = SparkSession.builder()
      .master("local[*]")
      .getOrCreate()
 import spark.implicits._
 val cols = (1 to 4).map( i => s"p$i")

    val listDf = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
      .toDF(cols: _*)   // Map the data to new column names
      .withColumn("index",   // Create a column with auto increasing id
        functions.concat(functions.lit("p"),functions.monotonically_increasing_id())) 

    listDf.show()