将StructType定义为Spark Scala 2.11函数的输入数据类型

将StructType定义为Spark Scala 2.11函数的输入数据类型,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我试图在scala中编写Spark UDF,我需要定义函数的输入数据类型 我有一个StructType模式变量,如下所述 import org.apache.spark.sql.types._ val relationsSchema = StructType( Seq( StructField("relation", ArrayType( StructType(Seq( StructField("attribute", S

我试图在scala中编写Spark UDF,我需要定义函数的输入数据类型

我有一个StructType模式变量,如下所述

import org.apache.spark.sql.types._

val relationsSchema = StructType(
      Seq(
        StructField("relation", ArrayType(
          StructType(Seq(
            StructField("attribute", StringType, true),
            StructField("email", StringType, true),
            StructField("fname", StringType, true),
            StructField("lname", StringType, true)
            )
          ), true
        ), true)
      )
    )
我正在尝试写一个如下的函数

val relationsFunc: Array[Map[String,String]] => Array[String] = _.map(do something)
val relationUDF = udf(relationsFunc)

input.withColumn("relation",relationUDF(col("relation")))
上面的代码抛出下面的异常

org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(relation)' due to data type mismatch: argument 1 requires array<map<string,string>> type, however, '`relation`' is of array<struct<attribute:string,email:string,fname:string,lname:string>> type.;;
'Project [relation#89, UDF(relation#89) AS proc#273]

关系下的结构是一个
,因此函数应具有以下签名:

val relationsFunc: Array[Row] => Array[String]
然后,您可以通过位置或名称访问数据,即:

{r:Row => r.getAs[String]("email")}

检查文档中的映射表以确定Spark SQL和Scala之间的数据类型表示:

您的
关系
字段是类型为
StructType
的Spark SQL复杂类型,由Scala type
org.apache.Spark.SQL.Row
表示,因此这是您应该使用的输入类型

我使用您的代码创建了一个完整的工作示例,该示例提取
电子邮件
值:

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val relationsSchema = StructType(
  Seq(
    StructField("relation", ArrayType(
      StructType(
        Seq(
          StructField("attribute", StringType, true),
          StructField("email", StringType, true),
          StructField("fname", StringType, true),
          StructField("lname", StringType, true)
        )
      ), true
    ), true)
  )
)

val data = Seq(
  Row("{'relation':[{'attribute':'1','email':'johnny@example.com','fname': 'Johnny','lname': 'Appleseed'}]}")
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  relationsSchema
)

val relationsFunc = (relation: Array[Row]) => relation.map(_.getAs[String]("email"))
val relationUdf = udf(relationsFunc)

df.withColumn("relation", relationUdf(col("relation")))

这给了我一个原因:java.lang.ClassCastException:scala.collection.mutable.WrappedArray$ofRef不能强制转换为[Lorg.apache.spark.sql.Row;但是我使用了Seq而不是数组,这很有效!谢谢
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val relationsSchema = StructType(
  Seq(
    StructField("relation", ArrayType(
      StructType(
        Seq(
          StructField("attribute", StringType, true),
          StructField("email", StringType, true),
          StructField("fname", StringType, true),
          StructField("lname", StringType, true)
        )
      ), true
    ), true)
  )
)

val data = Seq(
  Row("{'relation':[{'attribute':'1','email':'johnny@example.com','fname': 'Johnny','lname': 'Appleseed'}]}")
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  relationsSchema
)

val relationsFunc = (relation: Array[Row]) => relation.map(_.getAs[String]("email"))
val relationUdf = udf(relationsFunc)

df.withColumn("relation", relationUdf(col("relation")))