Pyspark获取聚合表中的计数
我有一张这样的桌子:Pyspark获取聚合表中的计数,pyspark,pyspark-dataframes,Pyspark,Pyspark Dataframes,我有一张这样的桌子: +-------------+-----+ | PULocationID| fare| +-------------+-----+ | 1| 5| | 1| 15| | 2| 2| +-------------+-----+ +-------------+----------+------+ | PULocationID| avg_fare | count| +-------------+-
+-------------+-----+
| PULocationID| fare|
+-------------+-----+
| 1| 5|
| 1| 15|
| 2| 2|
+-------------+-----+
+-------------+----------+------+
| PULocationID| avg_fare | count|
+-------------+----------+------+
| 1| 10| 2|
| 2| 2| 1|
+-------------+----------+------+
我想要一张这样的桌子:
+-------------+-----+
| PULocationID| fare|
+-------------+-----+
| 1| 5|
| 1| 15|
| 2| 2|
+-------------+-----+
+-------------+----------+------+
| PULocationID| avg_fare | count|
+-------------+----------+------+
| 1| 10| 2|
| 2| 2| 1|
+-------------+----------+------+
以下是我正在尝试的:
result_table = trips.groupBy("PULocationID") \
.agg(
{"total_amount": "avg"},
{"PULocationID": "count"}
)
如果我去掉count行,得到avg列就可以了。但是我还需要计算有多少行有这个特定的PULocationID
注意:除了pyspark.sql.functions import col
谢谢你的帮助 我非常接近,我只是把它格式化成两本字典而不是一本
result_table = trips.groupBy("PULocationID") \
.agg(
{"total_amount": "avg","PULocationID":"count"}
)
这应该是您的工作解决方案-使用
avg()
和count()
df = spark.createDataFrame([(1,5),(1,15),(2,2)],[ "PULocationID","fare"])
df.show()
df_group = df.groupBy("PULocationID").agg(F.avg("fare").alias("avg_fare"), F.count("PULocationID").alias("count"))
df_group.show()
**Input**
+------------+----+
|PULocationID|fare|
+------------+----+
| 1| 5|
| 1| 15|
| 2| 2|
+------------+----+
Output
+------------+--------+-----+
|PULocationID|avg_fare|count|
+------------+--------+-----+
| 1| 10.0| 2|
| 2| 2.0| 1|
+------------+--------+-----+