Apache spark 如何映射具有相当复杂模式的数据集？_Apache Spark_Dataframe_Apache Spark Sql

Apache spark 如何映射具有相当复杂模式的数据集？

apache-spark dataframe

Apache spark 如何映射具有相当复杂模式的数据集？,apache-spark,dataframe,apache-spark-sql,Apache Spark,Dataframe,Apache Spark Sql,我正在使用一个数据帧，该数据帧具有类似于以下内容的复杂模式： root |-- NPAData: struct (nullable = true) | |-- NPADetails: struct (nullable = true) | | |-- location: string (nullable = true) | | |-- manager: string (nullable = true) | |-- usersDetails: arra

我正在使用一个数据帧，该数据帧具有类似于以下内容的复杂模式：

 root
 |-- NPAData: struct (nullable = true)
 |    |-- NPADetails: struct (nullable = true)
 |    |    |-- location: string (nullable = true)
 |    |    |-- manager: string (nullable = true)
 |    |-- usersDetails: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- contacts: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- phone: string (nullable = true)
 |    |    |    |    |    |-- email: string (nullable = true)
 |    |    |    |    |    |-- address: string (nullable = true)
 |    |-- service: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- serviceName: string (nullable = true)
 |    |    |    |-- serviceCode: string (nullable = true) 
 |-- NPAHeader: struct (nullable = true)
 |    |    |-- code: string (nullable = true)
 |    |    |-- date: string (nullable = true)

我想在数据框的每一行执行一个映射，应用一个自定义函数以满足要求：

数据帧的每一行都有2个或更多的元素，它们具有我在问题中发布的结构。首先，我想在一个行列表中分离每行的元素，因为我需要比较它们。一个我有一个DataFrame[List[Row]]我想应用另一个映射，这样我可以合并每个列表的元素（为此我编写了一个递归函数，用于检查列表中的顺序，并用旧元素的值填充新元素的空字段）。在我使用RDD做所有这些之前，我正在尝试使用DataFrameAPI做同样的事情

我想我需要通过一个编码器

由于模式相当复杂（至少我不知道当存在数组时如何生成StructType，其中的元素也是数组），所以我尝试通过传递模式来生成编码器，如下所示：

import org.apache.spark.sql.catalyst.encoders.RowEncoder

val sourceSchema = dfSoruce.schema 

val encoder = RowEncoder(sourceSchema)

dfSoruce.map(x => x.getList[Row](0))(encoder)

但我得到了以下错误：

类型失配；发现： org.apache.spark.sql.catalyst.encoders.ExpressionEncoder[org.apache.spark.sql.Row] 必修的： org.apache.spark.sql.Encoder[java.util.List[org.apache.spark.sql.Row]]

如何将ExpressionEncoder转换为编码器

我想在数据帧的每一行执行一个应用自定义函数的映射，但为此我需要传递一个编码器

我不同意

地图操作员（要避免）引用

map

运算符的名称：

映射[U]（函数：（T）⇒ U）（隐式arg0:Encoder[U]）：数据集[U]返回一个新的数据集，其中包含对每个元素应用func的结果

您可能已经注意到编码器（在第二个参数列表中）是一个隐式参数，因此不必显式提供（这就是Scala中隐式的美妙之处，不是吗？）

我的建议是使用

func

转换为可编码类型

，即可以在数据集中使用的任何类型。您可以在object中找到将类型转换为其可编码变体的可用编码器

但我宁愿在

withColumn

和标准函数出现缺陷后，才将

map

用于更高级的转换

（推荐）具有列运算符和标准功能我宁愿在对象中使用带有标准函数的

with column

操作符，它将为您提供类似于

映射的行为
让我们检查一下您的需求，看看我们在这个方法上走了多远
首先，我想在一个行列表中分离每行的元素
对我来说，行列表听起来像是groupBy
聚合，然后是collect\u list
函数（可能使用一些withColumn
操作符来提取所需的值）

您不必过多地考虑编码器（考虑到它们在Spark SQL中是非常低级且相当高级的概念）早上好，我试图尽可能地遵循您的建议，但我无法应用标准函数来实现我需要的功能。所以我需要回到map，因为我要传递的是一个Seq[Row]到map中的函数，它在运行时要求显式编码器。但我面临着正确生成编码器的问题。我在这里发布了一个关于这个问题的新问题，我会留下这个链接，以防你有空检查：谢谢
scala> :type ids
org.apache.spark.sql.Dataset[Long]

scala> ids.map(id => (id, "hello" * id.toInt)).show(truncate = false)
+---+---------------------------------------------+
|_1 |_2                                           |
+---+---------------------------------------------+
|0  |                                             |
|1  |hello                                        |
|2  |hellohello                                   |
|3  |hellohellohello                              |
|4  |hellohellohellohello                         |
|5  |hellohellohellohellohello                    |
|6  |hellohellohellohellohellohello               |
|7  |hellohellohellohellohellohellohello          |
|8  |hellohellohellohellohellohellohellohello     |
|9  |hellohellohellohellohellohellohellohellohello|
+---+---------------------------------------------+

// leave you to fill the gaps
dfSoruce.withColumn(...).groupBy(...).agg(collect_list(...))