Python 多列上的Groupby操作
我已经应用了groupby并计算了pyspark dataframe中两个特性的标准偏差Python 多列上的Groupby操作,python,group-by,pyspark,Python,Group By,Pyspark,我已经应用了groupby并计算了pyspark dataframe中两个特性的标准偏差 from pyspark.sql import functions as f val1 = [('a',20,100),('a',100,100),('a',50,100),('b',0,100),('b',0,100),('c',0,0),('c',0,50),('c',0,100),('c',0,20)] cols = ['group','val1','val2'] tf = spark.creat
from pyspark.sql import functions as f
val1 = [('a',20,100),('a',100,100),('a',50,100),('b',0,100),('b',0,100),('c',0,0),('c',0,50),('c',0,100),('c',0,20)]
cols = ['group','val1','val2']
tf = spark.createDataFrame(val1, cols)
tf.show()
tf.groupby('group').agg(f.stddev(['val1','val2']).alias('val1_std','val2_std'))
但它给了我以下的错误
TypeError: _() takes 1 positional argument but 2 were given
如何在pyspark中执行它?问题在于
stddev
函数作用于一列,而不是像您编写的代码中那样作用于多列(因此出现了关于1对2参数的错误消息)。获取所需信息的一种方法是分别计算每列的标准偏差:
# std dev for each col
expressions = [f.stddev(col).alias('%s_std'%(col)) for col in ['val1','val2']]
# Now run it
tf.groupby('group').agg(*expressions).show()
#+-----+------------------+------------------+
#|group| val1_std| val2_std|
#+-----+------------------+------------------+
#| c| 0.0|43.493294502332965|
#| b| 0.0| 0.0|
#| a|40.414518843273804| 0.0|
#+-----+------------------+------------------+