Apache spark 计算pyspark中的分组中值

Apache spark 计算pyspark中的分组中值,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,当使用pyspark时,我希望能够计算分组值和分组中值之间的差异。这可能吗?这里有一些我破解的代码,除了它根据平均值计算分组的差异外,它实现了我想要的功能。此外,如果您愿意提供帮助,请随时评论我如何能使这一点更好:) 从pyspark导入SparkContext 从pyspark.sql导入SparkSession 从pyspark.sql.types导入( StringType, 长型, 双重类型, StructField, 结构类型 ) 从pyspark.sql导入函数为F sc=Spark

当使用pyspark时,我希望能够计算分组值和分组中值之间的差异。这可能吗?这里有一些我破解的代码,除了它根据平均值计算分组的差异外,它实现了我想要的功能。此外,如果您愿意提供帮助,请随时评论我如何能使这一点更好:)

从pyspark导入SparkContext
从pyspark.sql导入SparkSession
从pyspark.sql.types导入(
StringType,
长型,
双重类型,
StructField,
结构类型
)
从pyspark.sql导入函数为F
sc=SparkContext(appName='myapp')
火花=火花会话(sc)
文件名='data.csv'
字段=[
结构场(
"第二组",,
LongType(),
对),,
结构场(
“姓名”,
StringType(),
对),,
结构场(
“价值”,
DoubleType(),
对),,
结构场(
"第一组",,
LongType(),
(对)
]
schema=StructType(字段)
df=spark.read.csv(
文件名,header=False,mode=“DROPMALFORMED”,schema=schema
)
df.show()
平均值=df.select([
"第一组",,
"第二组",,
“姓名”,
“值”])。groupBy([
"第一组",,
“第二组”
])阿格先生(
F.平均值(“值”).别名(“平均值”)
).orderBy('group1'、'group2')
cond=[df.group1==means.group1,df.group2==means.group2]
意思是
df=df.select([
"第一组",,
"第二组",,
“姓名”,
'值'])。加入(
方法
康德
).放下(
df.group1
).放下(
df.group2
).选择('group1',
"第二组",,
“姓名”,
“价值”,
‘平均值’)
final=df.withColumn(
"diff",,
F.abs(df.value-df.mean_值))
final.show()
sc.停止()
下面是我正在使用的一个示例数据集:

100,name1,0.43,0
100,name2,0.33,0
100,name3,0.73,0
101,name1,0.29,0
101,name2,0.96,0
101,name3,0.42,0
102,name1,0.01,0
102,name2,0.42,0
102,name3,0.51,0
103,name1,0.55,0
103,name2,0.45,0
103,name3,0.02,0
104,name1,0.93,0
104,name2,0.16,0
104,name3,0.74,0
105,name1,0.41,0
105,name2,0.65,0
105,name3,0.29,0
100,name1,0.51,1
100,name2,0.51,1
100,name3,0.43,1
101,name1,0.59,1
101,name2,0.55,1
101,name3,0.84,1
102,name1,0.01,1
102,name2,0.98,1
102,name3,0.44,1
103,name1,0.47,1
103,name2,0.16,1
103,name3,0.02,1
104,name1,0.83,1
104,name2,0.89,1
104,name3,0.31,1
105,name1,0.59,1
105,name2,0.77,1
105,name3,0.45,1
以下是我试图制作的内容:

group1,group2,name,value,median,diff
0,100,name1,0.43,0.43,0.0
0,100,name2,0.33,0.43,0.10
0,100,name3,0.73,0.43,0.30
0,101,name1,0.29,0.42,0.13
0,101,name2,0.96,0.42,0.54
0,101,name3,0.42,0.42,0.0
0,102,name1,0.01,0.42,0.41
0,102,name2,0.42,0.42,0.0
0,102,name3,0.51,0.42,0.09
0,103,name1,0.55,0.45,0.10
0,103,name2,0.45,0.45,0.0
0,103,name3,0.02,0.45,0.43
0,104,name1,0.93,0.74,0.19
0,104,name2,0.16,0.74,0.58
0,104,name3,0.74,0.74,0.0
0,105,name1,0.41,0.41,0.0
0,105,name2,0.65,0.41,0.24
0,105,name3,0.29,0.41,0.24
1,100,name1,0.51,0.51,0.0
1,100,name2,0.51,0.51,0.0
1,100,name3,0.43,0.51,0.08
1,101,name1,0.59,0.59,0.0
1,101,name2,0.55,0.59,0.04
1,101,name3,0.84,0.59,0.25
1,102,name1,0.01,0.44,0.43
1,102,name2,0.98,0.44,0.54
1,102,name3,0.44,0.44,0.0
1,103,name1,0.47,0.16,0.31
1,103,name2,0.16,0.16,0.0
1,103,name3,0.02,0.16,0.14
1,104,name1,0.83,0.83,0.0
1,104,name2,0.89,0.83,0.06
1,104,name3,0.31,0.83,0.52
1,105,name1,0.59,0.59,0.0
1,105,name2,0.77,0.59,0.18
1,105,name3,0.45,0.59,0.14

您可以使用udf函数
median
求解它。首先,让我们创建上面给出的简单示例

#示例数据
ls=[[100,'name1',0.43,0],
[100,'name2',0.33,0],
[100,'name3',0.73,0],
[101,'name1',0.29,0],
[101,'name2',0.96,0],
[...]]
df=spark.createDataFrame(ls,schema=['a','b','c','d'])
这里是计算中值的
udf
函数

#中位数的自定义项
将numpy作为np导入
将pyspark.sql.functions作为func导入
def中值(值列表):
med=np.中值(数值列表)
返回浮动(med)
udf_median=func.udf(median,FloatType())
groupdf=df.groupby(['a','d'])
df_group=group_df.agg(udf_median(函数集合列表(col('c')))。别名('median'))
df_grouped.show()
最后,您可以将其与原始
df
on连接起来,以恢复中间列

df_grouped=df_grouped.WithColumnRename('a','a_')。WithColumnRename('d','d_'))
df_final=df.join(df_grouped,[df.a==df_grouped.a,df.d==df_grouped.d.])。选择('a','b','c','median')
df_final=df_final.withColumn('diff',func.round(func.col('c'))-func.col('median'),scale=2))

请注意,我在结尾使用了
round
,以防止中值运算后出现额外的数字。

我正在尝试使用
window
功能。然而,我仍然没有用我创建的
udf
median函数完成任务。我的理解是要正确地完成这项任务,您需要一个udaf函数,因为它将在.agg(…)中实现,但在python中没有udaf。是的,这是正确的@craigching。我刚刚更新了我的尝试,以实现平均聚合。然而,这并不是你要求的正确解决方案。你的udf中位数对我有效。你在那里做什么有什么需要注意的吗?@craigching,是的,这很有效。它只是不能为你提供正确的解决方案。为了完成任务,你必须用
('a','b','d')
把它连接起来。如果你想清理它,去掉第一部分,只保留中间部分,我愿意标记这个答案,因为它正是我所要求的,并且为我工作。尽管我认为如果你有窗口的话,它的实现会非常简洁。嗨@craigching,我清理了解决方案。希望这对你有用!如果您想在不使用任何自定义项的情况下计算中值,可以在此处查看解决方案: