Pyspark获取聚合表中的计数

Pyspark获取聚合表中的计数,pyspark,pyspark-dataframes,Pyspark,Pyspark Dataframes,我有一张这样的桌子: +-------------+-----+ | PULocationID| fare| +-------------+-----+ | 1| 5| | 1| 15| | 2| 2| +-------------+-----+ +-------------+----------+------+ | PULocationID| avg_fare | count| +-------------+-

我有一张这样的桌子:

+-------------+-----+
| PULocationID| fare|
+-------------+-----+
|            1|    5|
|            1|   15|
|            2|    2|
+-------------+-----+
+-------------+----------+------+
| PULocationID| avg_fare | count|
+-------------+----------+------+
|            1|        10|     2|
|            2|         2|     1|
+-------------+----------+------+
我想要一张这样的桌子:

+-------------+-----+
| PULocationID| fare|
+-------------+-----+
|            1|    5|
|            1|   15|
|            2|    2|
+-------------+-----+
+-------------+----------+------+
| PULocationID| avg_fare | count|
+-------------+----------+------+
|            1|        10|     2|
|            2|         2|     1|
+-------------+----------+------+
以下是我正在尝试的:

result_table = trips.groupBy("PULocationID") \
        .agg(
            {"total_amount": "avg"},
            {"PULocationID": "count"}
    )
如果我去掉count行,得到avg列就可以了。但是我还需要计算有多少行有这个特定的PULocationID

注意:除了
pyspark.sql.functions import col


谢谢你的帮助

我非常接近,我只是把它格式化成两本字典而不是一本

result_table = trips.groupBy("PULocationID") \
        .agg(
            {"total_amount": "avg","PULocationID":"count"}
    )

这应该是您的工作解决方案-使用
avg()
count()

df = spark.createDataFrame([(1,5),(1,15),(2,2)],[ "PULocationID","fare"])
df.show()
df_group = df.groupBy("PULocationID").agg(F.avg("fare").alias("avg_fare"), F.count("PULocationID").alias("count"))
df_group.show()

**Input**
+------------+----+
|PULocationID|fare|
+------------+----+
|           1|   5|
|           1|  15|
|           2|   2|
+------------+----+

Output
+------------+--------+-----+
|PULocationID|avg_fare|count|
+------------+--------+-----+
|           1|    10.0|    2|
|           2|     2.0|    1|
+------------+--------+-----+