将数组转换为Scala中具有列和索引的数据帧
最初我有一个矩阵将数组转换为Scala中具有列和索引的数据帧,scala,apache-spark-sql,Scala,Apache Spark Sql,最初我有一个矩阵 0.0 0.4 0.4 0.0 0.1 0.0 0.0 0.7 0.0 0.2 0.0 0.3 0.3 0.0 0.0 0.0 矩阵矩阵通过 `val normal_array = matrix.toArray` 我有一个字符串数组 inputCols : Array[String] = Array(p1, p2, p3, p4) 我需要将这个矩阵转换成下面的数据帧。(注意:矩阵中的行数和列数将与输入的长度相同) 在python中
0.0 0.4 0.4 0.0
0.1 0.0 0.0 0.7
0.0 0.2 0.0 0.3
0.3 0.0 0.0 0.0
矩阵矩阵
通过
`val normal_array = matrix.toArray`
我有一个字符串数组
inputCols : Array[String] = Array(p1, p2, p3, p4)
我需要将这个矩阵转换成下面的数据帧。(注意:矩阵中的行数和列数将与输入的长度相同
)
在python中,这可以通过pandas
库轻松实现
arrayToDataframe = pandas.DataFrame(normal_array,columns = inputCols, index = inputCols)
但是我如何在Scala中做到这一点呢?您可以像下面这样做
//convert your data to Scala Seq/List/Array
val list = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
//Define your Array of desired columns
val inputCols : Array[String] = Array("p1", "p2", "p3", "p4")
//Create DataFrame from given data, It will create dataframe with its own column names like _c1,_c2 etc
val df = sparkSession.createDataFrame(list)
//Getting the list of column names from dataframe
val dfColumns=df.columns
//Creating query to rename columns
val query=inputCols.zipWithIndex.map(index=>dfColumns(index._2)+" as "+inputCols(index._2))
//Firing above query
val newDf=df.selectExpr(query:_*)
//Creating udf which get index(0,1,2,3) as input and returns corresponding column name from your given array of columns
val getIndexUDF=udf((row_no:Int)=>inputCols(row_no))
//Adding temporary column row_no which contains index of row and removing after adding index column
val dfWithRow=newDf.withColumn("row_no",monotonicallyIncreasingId).withColumn("index",getIndexUDF(col("row_no"))).drop("row_no")
dfWithRow.show
样本输出:
+---+---+---+---+-----+
| p1| p2| p3| p4|index|
+---+---+---+---+-----+
|0.0|0.4|0.4|0.0| p1|
|0.1|0.0|0.0|0.7| p2|
|0.0|0.2|0.0|0.3| p3|
|0.3|0.0|0.0|0.0| p4|
+---+---+---+---+-----+
还有一种方法:
val data = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
val cols = Array("p1", "p2", "p3", "p4","index")
压缩集合并将其转换为DataFrame
data.zip(cols).map {
case (col,index) => (col._1,col._2,col._3,col._4,index)
}.toDF(cols: _*)
输出:
+---+---+---+---+-----+
|p1 |p2 |p3 |p4 |index|
+---+---+---+---+-----+
|0.0|0.4|0.4|0.0|p1 |
|0.1|0.0|0.0|0.7|p2 |
|0.0|0.2|0.0|0.3|p3 |
|0.3|0.0|0.0|0.0|p4 |
+---+---+---+---+-----+
较新和较短的版本应该是 对于Spark版本>2.4.5。 请查找语句的内联描述
val spark = SparkSession.builder()
.master("local[*]")
.getOrCreate()
import spark.implicits._
val cols = (1 to 4).map( i => s"p$i")
val listDf = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
.toDF(cols: _*) // Map the data to new column names
.withColumn("index", // Create a column with auto increasing id
functions.concat(functions.lit("p"),functions.monotonically_increasing_id()))
listDf.show()
您能提供您在下面的评论中解释的需求的示例输出吗?是的,我已经用示例编辑了我的问题。用Python做这件事非常简单。我用Scala提供了我的解决方案。我喜欢这个
单调递增ID
-生成的ID保证单调递增且唯一,但不是连续的。
val spark = SparkSession.builder()
.master("local[*]")
.getOrCreate()
import spark.implicits._
val cols = (1 to 4).map( i => s"p$i")
val listDf = Seq((0.0,0.4,0.4,0.0),(0.1,0.0,0.0,0.7),(0.0,0.2,0.0,0.3),(0.3,0.0,0.0,0.0))
.toDF(cols: _*) // Map the data to new column names
.withColumn("index", // Create a column with auto increasing id
functions.concat(functions.lit("p"),functions.monotonically_increasing_id()))
listDf.show()