Spark Scala联合失败,尽管两个数据帧具有相同的模式
在Windows上,Spark 2.3.1我尝试合并两个数据帧。虽然两者都有相同的模式,但我得到一个错误,说“只能对具有兼容列类型的表执行联合”,我不明白为什么。因为我已经完成了第二次转换,以便为第二个数据帧获得所需的模式 我的代码:Spark Scala联合失败,尽管两个数据帧具有相同的模式,scala,apache-spark-sql,union,Scala,Apache Spark Sql,Union,在Windows上,Spark 2.3.1我尝试合并两个数据帧。虽然两者都有相同的模式,但我得到一个错误,说“只能对具有兼容列类型的表执行联合”,我不明白为什么。因为我已经完成了第二次转换,以便为第二个数据帧获得所需的模式 我的代码: import breeze.linalg._ import org.apache.spark.ml.feature.{StandardScaler, VectorAssembler} import org.apache.spark.mllib.linalg.{Ve
import breeze.linalg._
import org.apache.spark.ml.feature.{StandardScaler, VectorAssembler}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.sql.{SparkSession}
import org.apache.spark.sql.functions.{rand => random, udf,col}
import org.apache.spark.sql.types._
object MahalanobisDeneme {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("KMeansZScore")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val df = spark.range(0, 10).select("id").
withColumn("uniform", random(10L)).
withColumn("normal1", random(10L)).
withColumn("normal2", random(11L))
//df.show()
val assembler = new VectorAssembler()
.setInputCols(Array("uniform", "normal1", "normal2"))
.setOutputCol("features")
val assembledDF = assembler.transform(df)
//assembledDF.show()
val idFeaturesDF = assembledDF.select("id","features")
idFeaturesDF.show(false)
idFeaturesDF.printSchema()
val outlierDF = spark.createDataFrame(Seq((10, Vectors.dense(5,5,5))))
val outlierDF2 = outlierDF
.withColumn("id", outlierDF.col("_1").cast("Long"))
.withColumn("features",outlierDF.col("_2"))
.select("id","features")
outlierDF2.show()
outlierDF2.printSchema()
val unionedDF = idFeaturesDF.union(outlierDF2)
unionedDF.show()
}
}
idFeaturesDF的架构输出:
root
|-- id: long (nullable = false)
|-- features: vector (nullable = true)
outlierDF2的架构输出:
root
|-- id: long (nullable = false)
|-- features: vector (nullable = true)
更多错误日志如下:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types.
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> <>
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> at the second column of the second table;;
'Union
:- AnalysisBarrier
: +- Project [id#0L, features#15]
: +- Project [id#0L, uniform#3, normal1#6, normal2#10, UDF(named_struct(uniform, uniform#3, normal1, normal1#6, normal2, normal2#10)) AS features#15]
: +- Project [id#0L, uniform#3, normal1#6, rand(11) AS normal2#10]
: +- Project [id#0L, uniform#3, rand(10) AS normal1#6]
: +- Project [id#0L, rand(10) AS uniform#3]
: +- Project [id#0L]
: +- Range (0, 10, step=1, splits=Some(8))
+- AnalysisBarrier
+- Project [id#35L, features#39]
+- Project [_1#31, _2#32, id#35L, _2#32 AS features#39]
+- Project [_1#31, _2#32, cast(_1#31 as bigint) AS id#35L]
+- LocalRelation [_1#31, _2#32]
线程“main”org.apache.spark.sql.AnalysisException中的异常:只能对具有兼容列类型的表执行联合。
结构
第二个表的第二列中的结构;;
“联盟
:-分析屏障
:+-Project[id#0L,features#15]
:+-Project[id#0L,uniform#3,normal1#6,normal2#10,UDF(命名为_struct(uniform,uniform#3,normal1,normal1#6,normal2,normal2#10))作为特征15]
:+-Project[id#0L,uniform#3,normal1#6,rand(11)作为normal2#10]
:+-Project[id#0L,uniform#3,rand(10)作为normal 1#6]
:+-项目[id#0L,rand(10)作为统一的#3]
:+-项目[id#0L]
:+-范围(0,10,步长=1,分段=一些(8))
+-分析屏障
+-项目[id#35L,功能#39]
+-项目["1"31,"2"32,id"35L,"2"32作为特征"39]
+-项目[_1#31,_2#32,cast(_1#31作为bigint)作为id 35L]
+-局部关系[u 1#31,_2#32]
尝试将导入org.apache.spark.mllib.linalg.{Vector,Vectors}
更改为导入org.apache.spark.ml.linalg.{Vector,Vectors}
。它们在printSchema()
中看起来相同,但它们不同
(我比较了idFeaturesDF.head.get(1).getClass
和outlierDF2.head.get(1).getClass
,因为消息抱怨第二列。)可能有3个问题#在结构类型列中存在null=true/false问题。#列名与用于联合复杂数据类型的#Spark bug不匹配。我们可以通过在两个数据帧上执行.printSchema()来检查前两个问题。我添加了printSchema()输出。它们是一样的。