Scala 将数据帧取消平台化为特定结构_Scala_Apache Spark_Dataframe_Apache Spark Sql_User Defined Functions

Scala 将数据帧取消平台化为特定结构

scala apache-spark dataframe

Scala 将数据帧取消平台化为特定结构,scala,apache-spark,dataframe,apache-spark-sql,user-defined-functions,Scala,Apache Spark,Dataframe,Apache Spark Sql,User Defined Functions,我有一个平面数据帧（df），结构如下： root |-- first_name: string (nullable = true) |-- middle_name: string (nullable = true) |-- last_name: string (nullable = true) |-- title: string (nullable = true) |-- start_date: string (nullable = true) |-- end_Date: strin

我有一个平面数据帧（

df

），结构如下：

root
 |-- first_name: string (nullable = true)
 |-- middle_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- title: string (nullable = true)
 |-- start_date: string (nullable = true)
 |-- end_Date: string (nullable = true)
 |-- city: string (nullable = true)
 |-- zip_code: string (nullable = true)
 |-- state: string (nullable = true)
 |-- country: string (nullable = true)
 |-- email_name: string (nullable = true)
 |-- company: struct (nullable = true)
 |-- org_name: string (nullable = true)
 |-- company_phone: string (nullable = true)
 |-- partition_column: string (nullable = true)

我需要将此数据帧转换为如下结构（因为我的下一个数据将采用这种格式）：

到目前为止，我已经实施了以下措施：

case class IndividualCompany(orgName: String,
                             companyPhone: String)

case class IndividualAddress(city: String,
                   zipCode: String,
                   state: String,
                   country: String)

case class IndividualPosition(title: String,
                              startDate: String,
                              endDate: String,
                              address: IndividualAddress,
                              emailName: String,
                              company: IndividualCompany)

case class Individual(firstName: String,
                     middleName: String,
                     lastName: String,
                     currentPosition: Seq[IndividualPosition],
                     partitionColumn: String)


val makeCompany = udf((orgName: String, companyPhone: String) => IndividualCompany(orgName, companyPhone))
val makeAddress = udf((city: String, zipCode: String, state: String, country: String) => IndividualAddress(city, zipCode, state, country))

val makePosition = udf((title: String, startDate: String, endDate: String, address: IndividualAddress, emailName: String, company: IndividualCompany) 
                    => List(IndividualPosition(title, startDate, endDate, address, emailName, company)))


val selectData = df.select(
      col("first_name").as("firstName"),
      col("middle_name).as("middleName"),
      col("last_name").as("lastName"),
      makePosition(col("job_title"),
        col("start_date"),
        col("end_Date"),
        makeAddress(col("city"),
          col("zip_code"),
          col("state"),
          col("country")),
        col("email_name"),
        makeCompany(col("org_name"),
          col("company_phone"))).as("currentPosition"),
      col("partition_column").as("partitionColumn")
    ).as[Individual]

select_data.printSchema()
select_data.show(10)

我可以看到为

select\u data

生成的正确模式，但它在最后一行中给出了一个错误，我试图获取一些实际数据。我收到一个错误，说无法执行用户定义的函数

 org.apache.spark.SparkException: Failed to execute user defined function(anonfun$4: (string, string, string, struct<city:string,zipCode:string,state:string,country:string>, string, struct<orgName:string,companyPhone:string>) => array<struct<title:string,startDate:string,endDate:string,address:struct<city:string,zipCode:string,state:string,country:string>,emailName:string,company:struct<orgName:string,companyPhone:string>>>)

org.apache.spark.SparkException:无法执行用户定义的函数（anonfun$4:（string，string，string，struct，string，struct）=>array）

有没有更好的方法来实现这一点

我也有类似的要求。
我所做的是创建一个将生成元素的

列表

import org.apache.spark.sql.{Encoder, TypedColumn}
import org.apache.spark.sql.expressions.Aggregator
import scala.collection.mutable

object ListAggregator {
  private type Buffer[T] = mutable.ListBuffer[T]

  /** Returns a column that aggregates all elements of type T in a List. */
  def create[T](columnName: String)
               (implicit listEncoder: Encoder[List[T]], listBufferEncoder: Encoder[Buffer[T]]): TypedColumn[T, List[T]] =
    new Aggregator[T, Buffer[T], List[T]] {
      override def zero: Buffer[T] =
        mutable.ListBuffer.empty[T]

      override def reduce(buffer: Buffer[T], elem: T): Buffer[T] =
        buffer += elem

      override def merge(b1: Buffer[T], b2: Buffer[T]): Buffer[T] =
        if (b1.length >= b2.length) b1 ++= b2 else b2 ++= b1

      override def finish(reduction: Buffer[T]): List[T] =
        reduction.toList

      override def bufferEncoder: Encoder[Buffer[T]] =
        listBufferEncoder

      override def outputEncoder: Encoder[List[T]] =
        listEncoder
    }.toColumn.name(columnName)
}

现在你可以这样使用它了

import org.apache.spark.sql.SparkSession

val spark =
  SparkSession
    .builder
    .master("local[*]")
    .getOrCreate()

import spark.implicits._

final case class Flat(id: Int, name: String, age: Int)
final case class Grouped(age: Int, users: List[(Int, String)])

val data =
  List(
    (1, "Luis", 21),
    (2, "Miguel", 21),
    (3, "Sebastian", 16)
  ).toDF("id", "name", "age").as[Flat]

val grouped =
  data
    .groupByKey(flat => flat.age)
    .mapValues(flat => (flat.id, flat.name))
    .agg(ListAggregator.create(columnName = "users"))
    .map(tuple => Grouped(age = tuple._1, users = tuple._2))
// grouped: org.apache.spark.sql.Dataset[Grouped] = [age: int, users: array<struct<_1:int,_2:string>>]

grouped.show(truncate = false)
// +---+------------------------+
// |age|users                   |
// +---+------------------------+
// |16 |[[3, Sebastian]]        |
// |21 |[[1, Luis], [2, Miguel]]|
// +---+------------------------+

import org.apache.spark.sql.SparkSession
瓦尔火花=
SparkSession
建设者
.master（“本地[*]”）
.getOrCreate（）
导入spark.implicits_
最终案例类平面（id:Int，name:String，age:Int）
分组的最终案例类（年龄：Int，用户：List[（Int，String）]）
val数据=
名单(
（1，“路易斯”，21岁），
（2，“米格尔”，21岁），
（3，“塞巴斯蒂安”，16岁）
).toDF（“id”、“姓名”、“年龄”）。作为[单位]
val分组=
数据
.groupByKey（flat=>flat.age）
.mapValues（flat=>（flat.id，flat.name））
.agg（ListAggregator.create（columnName=“users”））
.map（tuple=>Grouped（年龄=tuple.\u 1，用户=tuple.\u 2））
//分组：org.apache.spark.sql.Dataset[分组]=[年龄：int，用户：数组]
grouped.show（truncate=false）
// +---+------------------------+
//|年龄|用户|
// +---+------------------------+
//| 16 |[3，塞巴斯蒂安]|
//| 21 |[1，路易斯，[2，米格尔]]|
// +---+------------------------+

这里的问题是，

udf

不能直接将

个人地址

和

个人公司

作为输入。这些在Spark中表示为结构，要在

udf

中使用它们，正确的输入类型是

行

。这意味着您需要将

makePosition

的声明更改为：

val makePosition = udf((title: String, 
                        startDate: String, 
                        endDate: String, 
                        address: Row, 
                        emailName: String, 
                        company: Row)

在

udf

中，您现在需要使用例如

address.getAs[String]（“city”）

来访问case类元素，并且要将类作为一个整体使用，您需要再次创建它

更简单、更好的选择是在单个

udf

中完成所有操作，如下所示：

val makePosition = udf((title: String, 
    startDate: String, 
    endDate: String, 
    city: String, 
    zipCode: String, 
    state: String, 
    country: String,
    emailName: String, 
    orgName: String, 
    companyPhone: String) => 
        Seq(
          IndividualPosition(
            title, 
            startDate, 
            endDate, 
            IndividualAddress(city, zipCode, state, country),
            emailName, 
            IndividualCompany(orgName, companyPhone)
          )
        )
)

可能重复感谢此解决方案。在一个

udf

中执行所有操作都不允许传递超过10个参数，这就是为什么我选择嵌套

udf

。对于第一个解决方案，我如何从

df.select（）

方法将

行

类型传递到此udf。我刚刚将makeAddress和makeCompany方法修改为

val makeCompany=udf（（orgName:String，companyPhone:String）=>{Row（orgName，companyPhone）}，companySchema）

val makePosition = udf((title: String, 
    startDate: String, 
    endDate: String, 
    city: String, 
    zipCode: String, 
    state: String, 
    country: String,
    emailName: String, 
    orgName: String, 
    companyPhone: String) => 
        Seq(
          IndividualPosition(
            title, 
            startDate, 
            endDate, 
            IndividualAddress(city, zipCode, state, country),
            emailName, 
            IndividualCompany(orgName, companyPhone)
          )
        )
)