Apache spark 使用Pyspark聚合两列
通过PySpark学习apachespark并遇到问题 我有以下建议:Apache spark 使用Pyspark聚合两列,apache-spark,pyspark,Apache Spark,Pyspark,通过PySpark学习apachespark并遇到问题 我有以下建议: +----------+------------+-----------+----------------+ | game_id|posteam_type|total_plays|total_touchdowns| +----------+------------+-----------+----------------+ |2009092003| home| 90|
+----------+------------+-----------+----------------+
| game_id|posteam_type|total_plays|total_touchdowns|
+----------+------------+-----------+----------------+
|2009092003| home| 90| 3|
|2010091912| home| 95| 0|
|2010112106| home| 75| 0|
|2010121213| home| 85| 3|
|2009092011| null| 9| null|
|2010110703| null| 2| null|
|2010112111| null| 6| null|
|2011100909| home| 102| 3|
|2011120800| home| 72| 2|
|2012010110| home| 74| 6|
|2012110410| home| 68| 1|
|2012120911| away| 91| 2|
|2011103008| null| 6| null|
|2012111100| null| 3| null|
|2013092212| home| 86| 6|
|2013112407| home| 73| 4|
|2013120106| home| 99| 3|
|2014090705| home| 94| 3|
|2014101203| home| 77| 4|
|2014102611| home| 107| 6|
+----------+------------+-----------+----------------+
我试图找出得分TD或sum(总上场次数)/sum(总落地次数)所需的平均上场次数
我算出了求和的代码,但无法算出如何求总平均值:
plays = nfl_game_play.groupBy().agg({'total_plays': 'sum'}).collect()
touchdowns = nfl_game_play.groupBy().agg({'total_touchdowns',: 'sum'}).collect()
正如您所见,我尝试将每个值存储为变量,但不仅仅是记住每个值是什么并手动执行它。尝试使用以下代码:
示例:
df.show()
#+-----------+----------------+
#|total_plays|total_touchdowns|
#+-----------+----------------+
#| 90| 3|
#| 95| 0|
#| 9| null|
#+-----------+----------------+
from pyspark.sql.functions import *
total_avg=df.groupBy().agg(sum("total_plays")/sum("total_touchdowns")).collect()[0][0]
#64.66666666666667
您想在没有任何列的情况下进行分组,或者基于任何列进行分组,以便获得该特定组的平均值?您想要或期望的输出是什么?