Apache spark 使用Pyspark聚合两列

Apache spark 使用Pyspark聚合两列,apache-spark,pyspark,Apache Spark,Pyspark,通过PySpark学习apachespark并遇到问题 我有以下建议: +----------+------------+-----------+----------------+ | game_id|posteam_type|total_plays|total_touchdowns| +----------+------------+-----------+----------------+ |2009092003| home| 90|

通过PySpark学习apachespark并遇到问题

我有以下建议:

+----------+------------+-----------+----------------+
|   game_id|posteam_type|total_plays|total_touchdowns|
+----------+------------+-----------+----------------+
|2009092003|        home|         90|               3|
|2010091912|        home|         95|               0|
|2010112106|        home|         75|               0|
|2010121213|        home|         85|               3|
|2009092011|        null|          9|            null|
|2010110703|        null|          2|            null|
|2010112111|        null|          6|            null|
|2011100909|        home|        102|               3|
|2011120800|        home|         72|               2|
|2012010110|        home|         74|               6|
|2012110410|        home|         68|               1|
|2012120911|        away|         91|               2|
|2011103008|        null|          6|            null|
|2012111100|        null|          3|            null|
|2013092212|        home|         86|               6|
|2013112407|        home|         73|               4|
|2013120106|        home|         99|               3|
|2014090705|        home|         94|               3|
|2014101203|        home|         77|               4|
|2014102611|        home|        107|               6|
+----------+------------+-----------+----------------+
我试图找出得分TD或sum(总上场次数)/sum(总落地次数)所需的平均上场次数

我算出了求和的代码,但无法算出如何求总平均值:

plays = nfl_game_play.groupBy().agg({'total_plays': 'sum'}).collect()
touchdowns = nfl_game_play.groupBy().agg({'total_touchdowns',: 'sum'}).collect()
正如您所见,我尝试将每个值存储为变量,但不仅仅是记住每个值是什么并手动执行它。

尝试使用以下代码:

示例:

df.show()
#+-----------+----------------+
#|total_plays|total_touchdowns|
#+-----------+----------------+
#|         90|               3|
#|         95|               0|
#|          9|            null|
#+-----------+----------------+

from pyspark.sql.functions import *
total_avg=df.groupBy().agg(sum("total_plays")/sum("total_touchdowns")).collect()[0][0]
#64.66666666666667

您想在没有任何列的情况下进行分组,或者基于任何列进行分组,以便获得该特定组的平均值?您想要或期望的输出是什么?