spark-使用scala将数组列的所有元素分组为批_Scala_Dataframe_Apache Spark

spark-使用scala将数组列的所有元素分组为批

scala dataframe apache-spark

spark-使用scala将数组列的所有元素分组为批,scala,dataframe,apache-spark,Scala,Dataframe,Apache Spark,我有一个带有数组类型列的数据帧，其中行之间的元素数不同，如下面输入数据帧的GPS_array_Size所示我需要通过将“requestid”字段嵌入到每个元组（如输出数据帧中所示），向一个每次包含5000个元素的外部服务发布一个http请求输入数据帧模式：- root |-- requestid: string (nullable = true) |-- GPS_Array: array (nullable = true) | |-- element: struct (conta

我有一个带有数组类型列的数据帧，其中行之间的元素数不同，如下面输入数据帧的GPS_array_Size所示

我需要通过将“requestid”字段嵌入到每个元组（如输出数据帧中所示），向一个每次包含5000个元素的外部服务发布一个http请求

输入数据帧模式：-

root
 |-- requestid: string (nullable = true)
 |-- GPS_Array: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- timestamp: double (nullable = true)
 |    |    |-- GPSLatitude: double (nullable = true)
 |    |    |-- GPSLongitude: double (nullable = true)
 |-- GPS_Array_Size: long (nullable = true)

输入数据帧：-

root
 |-- requestid: string (nullable = true)
 |-- GPS_Array: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- timestamp: double (nullable = true)
 |    |    |-- GPSLatitude: double (nullable = true)
 |    |    |-- GPSLongitude: double (nullable = true)
 |-- GPS_Array_Size: long (nullable = true)

请求ID GPS_阵列 GPS_阵列_尺寸 aaa [{“时间戳”：a1，“GPSLatitude”：a1，“GPSLatitude”：a1}，{“时间戳”：a2，“GPSLatitude”：a2，“GPSLatitude”：a2}，{“时间戳”：a6431，“GPSLatitude”：a6431，“GPSLatitude”：a6431}] 6431 bbb [{“时间戳”：b1，“GPSLatitude”：b1，“GPSLatitude”：b1}，{“时间戳”：b2，“GPSLatitude”：b2，“GPSLatitude”：b2}，{“时间戳”：b11876，“GPSLatitude”：b11876，“GPSLatitude”：b11876}] 11876 ccc [{“时间戳”：c1，“GPSLatitude”：c1，“GPSLatitude”：c1}，{“时间戳”：c2，“GPSLatitude”：c2，“GPSLatitude”：c2}，{“时间戳”：c763，“GPSLatitude”：c763，“GPSLatitude”：c763}] 763 ddd [{“时间戳”：d1，“GPSLatitude”：d1，“GPSLatitude”：d1}，{“时间戳”：d2，“GPSLatitude”：d2}，{“时间戳”：d5187，“GPSLatitude”：d5187，“GPSLatitude”：d5187}] 5187 eee [{“时间戳”：e1，“GPSLatitude”：e1，“GPSLatitude”：e1}，{“时间戳”：e2，“GPSLatitude”：e2，“GPSLatitude”：e2}，{“时间戳”：e1023，“GPSLatitude”：e1023，“GPSLatitude”：e1023}] 1023

您是否尝试过使用上次讨论过的

重新分区

和

spark\u partition\u id

？是的。但要将7亿条记录分成140000个批次（5000个），我需要将数据帧重新分区140000个，然后分配

spark\u partition\u id

explodedDF.repartition（140000）.withColumn（“id”，spark\u partition\u id（））