Apache spark scala.collection.mutable.WrappedArray$ofRef不能强制转换为整数

Apache spark scala.collection.mutable.WrappedArray$ofRef不能强制转换为整数,apache-spark,apache-spark-sql,spark-dataframe,Apache Spark,Apache Spark Sql,Spark Dataframe,我对Spark和Scala是个新手。我试图调用一个函数作为Spark UDF,但我遇到了一个似乎无法解决的错误 我知道在Scala中,Array和Seq是不一样的。WrappedArray是Seq的一个子类型,WrappedArray和Array之间存在隐式转换,但我不确定为什么在UDF的情况下不会发生这种转换 非常感谢任何帮助我理解和解决这一问题的建议 下面是一段代码 def filterMapKeysWithSet(m: Map[Int, Int], a: Array[Int]): Map[

我对Spark和Scala是个新手。我试图调用一个函数作为Spark UDF,但我遇到了一个似乎无法解决的错误

我知道在Scala中,Array和Seq是不一样的。WrappedArray是Seq的一个子类型,WrappedArray和Array之间存在隐式转换,但我不确定为什么在UDF的情况下不会发生这种转换

非常感谢任何帮助我理解和解决这一问题的建议

下面是一段代码

def filterMapKeysWithSet(m: Map[Int, Int], a: Array[Int]): Map[Int, Int] = {
val seqToArray = a.toArray
val s = seqToArray.toSet
m filterKeys s
}

val myUDF = udf((m: Map[Int, Int], a: Array[Int]) => filterMapKeysWithSet(m, a))

case class myType(id: Int, m: Map[Int, Int])
val mapRDD = Seq(myType(1, Map(1 -> 100, 2 -> 200)), myType(2, Map(1 -> 100, 2 -> 200)), myType(3, Map(3 -> 300, 4 -> 400)))
val mapDF = mapRDD.toDF

mapDF: org.apache.spark.sql.DataFrame = [id: int, m: map<int,int>]
root
 |-- id: integer (nullable = false)
 |-- m: map (nullable = true)
 |    |-- key: integer
 |    |-- value: integer (valueContainsNull = false)

case class myType2(id: Int, a: Array[Int])
val idRDD = Seq(myType2(1, Array(1,2,100,200)), myType2(2, Array(100,200)), myType2(3, Array(1,2)) )
val idDF = idRDD.toDF

idDF: org.apache.spark.sql.DataFrame = [id: int, a: array<int>]
root
 |-- id: integer (nullable = false)
 |-- a: array (nullable = true)
 |    |-- element: integer (containsNull = false)

import sqlContext.implicits._
/* Hive context is exposed as sqlContext */

val j = mapDF.join(idDF, idDF("id") === mapDF("id")).drop(idDF("id"))
val k = j.withColumn("filteredMap",myUDF(j("m"), j("a")))
k.show
def filterMapKeysWithSet(m:Map[Int,Int],a:Array[Int]):Map[Int,Int]={
val seqToArray=a.toArray
val s=seqToArray.toSet
m过滤器
}
val myUDF=udf((m:Map[Int,Int],a:Array[Int])=>filterMapKeysWithSet(m,a))
案例类myType(id:Int,m:Map[Int,Int])
val mapRDD=Seq(myType(1,映射(1->100,2->200)),myType(2,映射(1->100,2->200)),myType(3,映射(3->300,4->400)))
val mapDF=mapRDD.toDF
mapDF:org.apache.spark.sql.DataFrame=[id:int,m:map]
根
|--id:整数(可空=假)
|--m:map(nullable=true)
||--键:整数
||--值:整数(valuecontainsnall=false)
案例类myType2(id:Int,a:Array[Int])
val idRDD=Seq(myType2(1,数组(1,2100200)),myType2(2,数组(100200)),myType2(3,数组(1,2)))
val idDF=idRDD.toDF
idDF:org.apache.spark.sql.DataFrame=[id:int,a:array]
根
|--id:整数(可空=假)
|--a:数组(nullable=true)
||--元素:整数(containsnall=false)
导入sqlContext.implicits_
/*配置单元上下文作为sqlContext公开*/
valj=mapDF.join(idDF,idDF(“id”)==mapDF(“id”)).drop(idDF(“id”))
val k=j.带列(“filteredMap”,myUDF(j(“m”),j(“a”))
k、 展示
查看数据帧“j”和“k”,map和array列具有正确的数据类型

j: org.apache.spark.sql.DataFrame = [id: int, m: map<int,int>, a: array<int>]
root
 |-- id: integer (nullable = false)
 |-- m: map (nullable = true)
 |    |-- key: integer
 |    |-- value: integer (valueContainsNull = false)
 |-- a: array (nullable = true)
 |    |-- element: integer (containsNull = false)

k: org.apache.spark.sql.DataFrame = [id: int, m: map<int,int>, a: array<int>, filteredMap: map<int,int>]
root
 |-- id: integer (nullable = false)
 |-- m: map (nullable = true)
 |    |-- key: integer
 |    |-- value: integer (valueContainsNull = false)
 |-- a: array (nullable = true)
 |    |-- element: integer (containsNull = false)
 |-- filteredMap: map (nullable = true)
 |    |-- key: integer
 |    |-- value: integer (valueContainsNull = false)
j:org.apache.spark.sql.DataFrame=[id:int,m:map,a:array]
根
|--id:整数(可空=假)
|--m:map(nullable=true)
||--键:整数
||--值:整数(valuecontainsnall=false)
|--a:数组(nullable=true)
||--元素:整数(containsnall=false)
k:org.apache.spark.sql.DataFrame=[id:int,m:map,a:array,filteredMap:map]
根
|--id:整数(可空=假)
|--m:map(nullable=true)
||--键:整数
||--值:整数(valuecontainsnall=false)
|--a:数组(nullable=true)
||--元素:整数(containsnall=false)
|--filteredMap:map(nullable=true)
||--键:整数
||--值:整数(valuecontainsnall=false)
但是,调用UDF的数据帧“k”上的操作失败,出现以下错误-

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 6, ip-100-74-42-194.ec2.internal): java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [I
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:60)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
    at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1865)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1865)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException:作业因阶段失败而中止:阶段1.0中的任务0失败了4次,最近的失败:阶段1.0中的任务0.3丢失(TID 6,ip-100-74-42-194.ec2.internal):java.lang.ClassCastException:scala.collection.mutable.WrappedArray$ofRef无法转换为[I]
在$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.申请(:60)
位于org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(未知源)
位于org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
位于org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
在scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
在scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
在scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
位于scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
位于scala.collection.Iterator$class.foreach(Iterator.scala:727)
位于scala.collection.AbstractIterator.foreach(迭代器.scala:1157)
在scala.collection.generic.growtable$class.$plus$plus$eq(growtable.scala:48)
在scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
在scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
在scala.collection.TraversableOnce$class.to处(TraversableOnce.scala:273)
在scala.collection.AbstractIterator.to(Iterator.scala:1157)
在scala.collection.TraversableOnce$class.toBuffer处(TraversableOnce.scala:265)
位于scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
位于scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
位于scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
位于org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1865)
位于org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1865)
位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
位于org.apache.spark.scheduler.Task.run(Task.scala:89)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:214)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
运行(Thread.java:745)

将函数filterMapKeysWithSet中的数据类型从Array[Int]更改为Seq[Int]似乎可以解决上述问题

def filterMapKeysWithSet(m: Map[Int, Int], a: Seq[Int]): Map[Int, Int] = {

    val seqToArray = a.toArray

    val s = seqToArray.toSet

    m filterKeys s

  }

val myUDF = udf((m: Map[Int, Int], a: Seq[Int]) => filterMapKeysWithSet(m, a))

k: org.apache.spark.sql.DataFrame = [id: int, m: map<int,int>, a: array<int>, filteredMap: map<int,int>]
root
 |-- id: integer (nullable = false)
 |-- m: map (nullable = true)
 |    |-- key: integer
 |    |-- value: integer (valueContainsNull = false)
 |-- a: array (nullable = true)
 |    |-- element: integer (containsNull = false)
 |-- filteredMap: map (nullable = true)
 |    |-- key: integer
 |    |-- value: integer (valueContainsNull = false)

+---+--------------------+----------------+--------------------+
| id|                   m|               a|         filteredMap|
+---+--------------------+----------------+--------------------+
|  1|Map(1 -> 100, 2 -...|[1, 2, 100, 200]|Map(1 -> 100, 2 -...|
|  2|Map(1 -> 100, 2 -...|      [100, 200]|               Map()|
|  3|Map(3 -> 300, 4 -...|          [1, 2]|               Map()|
+---+--------------------+----------------+--------------------+
def filterMapKeysWithSet(m:Map[Int,Int],a:Seq[Int]):Map[Int,Int]={
val seqToArray=a.toArray
val s=seqToArray.toSet
m过滤器
}
val myUDF=udf((m:Map[Int,Int],a:Seq[Int])=>filterMapKeysWithSet(m,a))
k:org.apache.spark.sql.DataFrame=[id:int,m:map,a:array,filteredMap:map]
根
|--id:整数(可空=假)
|--m:map(nullable=true)
||--键:整数
||--值:整数(valuecontainsnall=false)
|--a:数组(nullable=true)
||--元素:整数(co)