如何计算每行给定索引前后的行平均数-pyspark?

如何计算每行给定索引前后的行平均数-pyspark?,pyspark,spark-dataframe,Pyspark,Spark Dataframe,我有一个包含多列的数据框和一个索引,我必须计算索引前后这些列的平均值 这是我的熊猫代码: 对于rangelenres.index中的i: i=inti m=intres['index'].ix[i] n=列数[1:m] 如果n==0: res['mean'].ix[i]=0 其他: res['mean'].ix[i]=intres.ix[i,1:m].sum/n 我想在Pypark做这件事? 请帮忙 您可以使用pyspark中的UDF计算此值。以下是一个例子:- from pyspark.sql

我有一个包含多列的数据框和一个索引,我必须计算索引前后这些列的平均值

这是我的熊猫代码:

对于rangelenres.index中的i: i=inti m=intres['index'].ix[i] n=列数[1:m] 如果n==0: res['mean'].ix[i]=0 其他: res['mean'].ix[i]=intres.ix[i,1:m].sum/n 我想在Pypark做这件事?
请帮忙

您可以使用pyspark中的UDF计算此值。以下是一个例子:-

from pyspark.sql import functions as F
from pyspark.sql import types as T
import numpy as np


sample_data = sqlContext.createDataFrame([
    range(10)+[4],
    range(50, 60)+[2],
    range(9, 19)+[4],
    range(19, 29)+[3],
], ["col_"+str(i) for i in range(10)]+["index"])
sample_data.show()


+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|index|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|    0|    1|    2|    3|    4|    5|    6|    7|    8|    9|    4|
|   50|   51|   52|   53|   54|   55|   56|   57|   58|   59|    2|
|    9|   10|   11|   12|   13|   14|   15|   16|   17|   18|    4|
|   19|   20|   21|   22|   23|   24|   25|   26|   27|   28|    3|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+


def def_mn(data, index, mean="pre"):
    if mean == "pre":
        return sum(data[:index])/float(len(data[:index]))
    elif mean == "post":
        return sum(data[index:])/float(len(data[index:]))

mn_udf = F.udf(def_mn)

sample_data.withColumn(
    "index_pre_mean", 
    mn_udf(F.array([cl for cl in sample_data.columns[:-1]]), "index")
).withColumn(
    "index_post_mean", 
    mn_udf(F.array([cl for cl in sample_data.columns[:-1]]), "index", F.lit("post"))
).show()

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+
|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|index|index_pre_mean|index_post_mean|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+
|0    |1    |2    |3    |4    |5    |6    |7    |8    |9    |4    |1.5           |6.5            |
|50   |51   |52   |53   |54   |55   |56   |57   |58   |59   |2    |50.5          |55.5           |
|9    |10   |11   |12   |13   |14   |15   |16   |17   |18   |4    |10.5          |15.5           |
|19   |20   |21   |22   |23   |24   |25   |26   |27   |28   |3    |20.0          |25.0           |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+

我不清楚你在问什么。您能否显示一个示例数据帧,其中包含示例输入和所需输出?阅读更多关于。