Python 如何在pyspark中计算groupBy之后的唯一ID_Python_Pyspark_Apache Spark Sql

Python 如何在pyspark中计算groupBy之后的唯一ID

python pyspark

Python 如何在pyspark中计算groupBy之后的唯一ID,python,pyspark,apache-spark-sql,Python,Pyspark,Apache Spark Sql,我正在使用以下代码每年对学生进行授权。目的是了解每年的学生总数 from pyspark.sql.functions import col import pyspark.sql.functions as fn gr = Df2.groupby(['Year']) df_grouped = gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year')) 我发现的问题是，这么多的ID被重复，所以结果是错误的和巨大的我想按年统

我正在使用以下代码每年对学生进行授权。目的是了解每年的学生总数

from pyspark.sql.functions import col
import pyspark.sql.functions as fn
gr = Df2.groupby(['Year'])
df_grouped = 
gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))

我发现的问题是，这么多的ID被重复，所以结果是错误的和巨大的

我想按年统计学生人数，按年统计学生总数，避免重复ID。

使用countDistinct功能

from pyspark.sql.functions import countDistinct
x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")]
y = spark.createDataFrame(x,["year","id"])

gr = y.groupBy("year").agg(countDistinct("id"))
gr.show()

输出

+----+------------------+
|year|count(DISTINCT id)|
+----+------------------+
|2002|                 2|
|2001|                 2|
+----+------------------+

您还可以执行以下操作：

gr.groupBy（“年”、“id”）.count（）.groupBy（“年”）.count（）

此查询将返回每年唯一的学生

countDistinct（）

和多个aggr在流式处理中都不受支持。

为了完整性起见，我从配置单元表调用了数据，您也可以使用

.alias（）

重命名列。请注意，countDistinct不将Null计为一个不同的值！基于哪个版本？