使用scala中dataframe中的现有数据在dataframe中创建arraytype列_Scala_Apache Spark_Apache Spark Sql

使用scala中dataframe中的现有数据在dataframe中创建arraytype列

scala apache-spark

使用scala中dataframe中的现有数据在dataframe中创建arraytype列,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个数据帧所有者 accoutMasterId | OwnerMasterId |Owner name | 123 | ABC | Jack | 456 | DEF | Amy | 789 | ABC | Rach | 我想要一个新的数据框，其中包含如下数据： accoutMasterIdArray | OwnerMasterI

我有一个数据帧所有者

accoutMasterId | OwnerMasterId |Owner name |

123            | ABC           | Jack      |

456            | DEF           | Amy       |

789            | ABC           | Rach      |

我想要一个新的数据框，其中包含如下数据：

accoutMasterIdArray | OwnerMasterId 

{123,789}           | ABC    

{456}               | DEF

AccountMasterIdarray字段将为ArrayType。

有什么建议吗？

使用

.groupBy

和

collect\u list

函数创建数组

//sample dataframe 
ownerMaster.show()
//+---------------+-------------+---------+
//|accountMasterId|OwnerMasterId|Ownername|
//+---------------+-------------+---------+
//|            123|          ABC|     Jack|
//|            456|          DEF|      Amy|
//|            789|          ABC|     Rach|
//+---------------+-------------+---------+

ownerMaster.groupBy("OwnerMasterId").
agg(collect_list(col("accountMasterId")).alias("accoutMasterIdArray")).
show()

//casting array as string type then write as csv file
ownerMaster.groupBy("OwnerMasterId").
agg(collect_list(col("accountMasterId")).cast("string").alias("accoutMasterIdArray")).
show()
//+-------------+-------------------+
//|OwnerMasterId|accoutMasterIdArray|
//+-------------+-------------------+
//|          DEF|              [456]|
//|          ABC|         [123, 789]|
//+-------------+-------------------+

//schema
ownerMaster.groupBy("OwnerMasterId").agg(collect_list(col("accountMasterId")).alias("accoutMasterIdArray")).printSchema
//root
// |-- OwnerMasterId: string (nullable = true)
// |-- accoutMasterIdArray: array (nullable = true)
// |    |-- element: integer (containsNull = true)

请求首先共享您的代码您是否尝试了分组方式？我正在进一步将数据帧转换为动态帧，然后在我的粘合作业中使用它：glueContext.getSinkWithFormat（connectionType=“s3”，options=JsonOptions（“{”path:“s3://apexfiledrop/entities“}”），transformationContext=“datasink”，format=“csv”）。WritedDynamicFrame（joinDF）发现此错误：com.amazonaws.services.glue.util.SchemaException无法将关联帐户的数组字段写入CSV@ShrutiGusain，将

数组

列强制转换为

字符串

，然后作为csv文件写入。

ownerMaster.groupBy（“OwnerMasterId”）.agg（collect_list（col（“accountMasterId”））.Cast（“string”）.alias(“accoutMasterIdArray”）

你能帮我写下面提到的帖子吗：这将是一个很大的帮助。提前谢谢。