将嵌套Spark DataFrame中的列提取为scala数组
我有一个数据框将嵌套Spark DataFrame中的列提取为scala数组,scala,apache-spark,Scala,Apache Spark,我有一个数据框myDf,它包含一个点对数组(即x和y坐标),它有以下模式: myDf.printSchema root |-- pts: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- x: float (nullable = true) | | |-- y: float (nullable = true) 我想把x和y作为单个普通Scala数组。我想
myDf
,它包含一个点对数组(即x和y坐标),它有以下模式:
myDf.printSchema
root
|-- pts: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: float (nullable = true)
| | |-- y: float (nullable = true)
我想把x
和y
作为单个普通Scala数组
。我想我需要应用explode函数,但我不知道如何应用。我试图应用这个解决方案,但我无法使它起作用
我正在使用Spark 1.6.1和Scala 2.10
编辑:我意识到我误解了Spark的工作原理,只有在收集数据(或使用UDF)的情况下才能获取实际数组假定
myDf
是DataFrame
从json
文件读取:
{
"pts":[
{
"x":0.0,
"y":0.1
},
{
"x":1.0,
"y":1.1
},
{
"x":2.0,
"y":2.1
}
]
}
您可以像这样进行分解:
Java:
DataFrame pts = myDf.select(org.apache.spark.sql.functions.explode(df.col("pts")).as("pts"))
.select("pts.x", "pts.y");
pts.printSchema();
pts.show();
// Sorry I don't know Scala
// I just interpreted from the above Java code
// Code here may have some mistakes
val pts = myDf.select(explode($"pts").as("pts"))
.select($"pts.x", $"pts.y")
pts.printSchema()
pts.show()
Scala:
DataFrame pts = myDf.select(org.apache.spark.sql.functions.explode(df.col("pts")).as("pts"))
.select("pts.x", "pts.y");
pts.printSchema();
pts.show();
// Sorry I don't know Scala
// I just interpreted from the above Java code
// Code here may have some mistakes
val pts = myDf.select(explode($"pts").as("pts"))
.select($"pts.x", $"pts.y")
pts.printSchema()
pts.show()
以下是打印的模式:
root
|-- x: double (nullable = true)
|-- y: double (nullable = true)
下面是pts.show()
结果:
+---+---+
| x| y|
+---+---+
|0.0|0.1|
|1.0|1.1|
|2.0|2.1|
+---+---+
有两种方法可以将点作为plan scala阵列获取: 向驾驶员收取:
val localRows = myDf.take(10)
val xs: Array[Float] = localRows.map(_.getAs[Float]("x"))
val ys: Array[Float] = localRows.map(_.getAs[Float]("y"))
或者在UDF中:
val processArr = udf((pts:WrappedArray[Row]) => {
val xs: Array[Float] = pts.map(_.getAs[Float]("x")).array
val ys: Array[Float] = pts.map(_.getAs[Float]("y")).array
//...do something with it
})
}感谢提问者和回答者。你们让我开心。使用spark xml和您的解决方案rocks时,我非常激动;-)