Pyspark数据帧未正确分组

Pyspark数据帧未正确分组,pyspark,pyspark-dataframes,Pyspark,Pyspark Dataframes,我已经有一个应用程序成功运行了几个月。最近,它开始失败,因为一个方法接收pyspark数据帧,对其进行分组,并输出响应,但不知何故输出了一个损坏的数据帧 下面是当一切都失败时我正在做的示例代码: from pyspark.sql.functions import sum, avg group_by = pyspark_df_in.groupBy("dimension1", "dimension2", "dimension3") pyspark_df_out = group_by.agg(sum(

我已经有一个应用程序成功运行了几个月。最近,它开始失败,因为一个方法接收pyspark数据帧,对其进行分组,并输出响应,但不知何故输出了一个损坏的数据帧

下面是当一切都失败时我正在做的示例代码:

from pyspark.sql.functions import sum, avg
group_by = pyspark_df_in.groupBy("dimension1", "dimension2", "dimension3")
pyspark_df_out = group_by.agg(sum("metric1").alias("MyMetric1"), sum("metric2").alias("MyMetric2")
如果我打印
print(pyspark_df_in.head(1))
,我将正确地获得数据集的第一行。通过几个维度进行分组并打印
print(pyspark_df_out.head(2))
后,我得到以下错误。我在尝试使用这个新的grouped by数据集做任何事情时,都会遇到类似的错误(我知道group by应该生成数据,因为我已经仔细查看了它以确保)

19/10/08 15:13:00警告TaskSetManager:阶段9包含一个非常大的任务(282 KB)。建议的最大任务大小为100 KB。
19/10/08 15:13:00错误执行者:第9.0阶段(TID 40)任务1.0中的异常
java.util.NoSuchElementException
在java.util.ArrayList$Itr.next(ArrayList.java:862)
位于org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:76)
位于org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
位于org.apache.spark.sql.execution.arrow.arrowcorters$$anon$2.nextBatch(arrowcorters.scala:167)
位于org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2。(ArrowConverters.scala:144)
位于org.apache.spark.sql.execution.arrow.ArrowConverters$.fromBatchIterator(ArrowConverters.scala:143)
位于org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:203)
位于org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:201)
位于org.apache.spark.rdd.rdd$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(rdd.scala:801)
位于org.apache.spark.rdd.rdd$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(rdd.scala:801)
位于org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324)
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288)
位于org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324)
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288)
位于org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324)
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288)
位于org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324)
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288)
在org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)上
在org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)上
位于org.apache.spark.scheduler.Task.run(Task.scala:121)
位于org.apache.spark.executor.executor$TaskRunner$$anonfun$10.apply(executor.scala:408)
位于org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:414)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
运行(Thread.java:748)
19/10/08 15:13:00错误执行者:第9.0阶段任务2.0中的异常(TID 41)
java.util.NoSuchElementException
在java.util.ArrayList$Itr.next(ArrayList.java:862)
位于org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:76)
位于org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
位于org.apache.spark.sql.execution.arrow.arrowcorters$$anon$2.nextBatch(arrowcorters.scala:167)
位于org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2。(ArrowConverters.scala:144)
位于org.apache.spark.sql.execution.arrow.ArrowConverters$.fromBatchIterator(ArrowConverters.scala:143)
位于org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:203)
位于org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:201)
位于org.apache.spark.rdd.rdd$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(rdd.scala:801)
位于org.apache.spark.rdd.rdd$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(rdd.scala:801)
位于org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324)
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288)
位于org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324)
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288)
位于org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324)
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288)
位于org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324)
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288)
在org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)上
在org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)上
位于org.apache.spark.scheduler.Task.run(Task.scala:121)
位于org.apache.spark.executor.executor$TaskRunner$$anonfun$10.apply(executor.scala:408)
位于org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:414)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
运行(Thread.java:748)
19/10/08 15:13:00警告TaskSetManager:在s中丢失任务1.0
19/10/08 15:13:00 WARN TaskSetManager: Stage 9 contains a task of very large size (282 KB). The maximum recommended task size is 100 KB.
19/10/08 15:13:00 ERROR Executor: Exception in task 1.0 in stage 9.0 (TID 40)
java.util.NoSuchElementException
    at java.util.ArrayList$Itr.next(ArrayList.java:862)
    at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:76)
    at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.nextBatch(ArrowConverters.scala:167)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.<init>(ArrowConverters.scala:144)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$.fromBatchIterator(ArrowConverters.scala:143)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:203)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:201)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
19/10/08 15:13:00 ERROR Executor: Exception in task 2.0 in stage 9.0 (TID 41)
java.util.NoSuchElementException
    at java.util.ArrayList$Itr.next(ArrayList.java:862)
    at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:76)
    at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.nextBatch(ArrowConverters.scala:167)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.<init>(ArrowConverters.scala:144)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$.fromBatchIterator(ArrowConverters.scala:143)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:203)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:201)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 1.0 in stage 9.0 (TID 40, localhost, executor driver): java.util.NoSuchElementException
    at java.util.ArrayList$Itr.next(ArrayList.java:862)
    at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:76)
    at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.nextBatch(ArrowConverters.scala:167)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.<init>(ArrowConverters.scala:144)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$.fromBatchIterator(ArrowConverters.scala:143)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:203)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$3.apply(ArrowConverters.scala:201)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

19/10/08 15:13:00 ERROR TaskSetManager: Task 1 in stage 9.0 failed 1 times; aborting job
19/10/08 15:13:00 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 39, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 3.0 in stage 9.0 (TID 42, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 10.0 in stage 9.0 (TID 49, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 7.0 in stage 9.0 (TID 46, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 6.0 in stage 9.0 (TID 45, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 11.0 in stage 9.0 (TID 50, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 9.0 in stage 9.0 (TID 48, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 8.0 in stage 9.0 (TID 47, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 4.0 in stage 9.0 (TID 43, localhost, executor driver): TaskKilled (Stage cancelled)
19/10/08 15:13:00 WARN TaskSetManager: Lost task 5.0 in stage 9.0 (TID 44, localhost, executor driver): TaskKilled (Stage cancelled)