如何计算pyspark数据帧中值的条件概率?

如何计算pyspark数据帧中值的条件概率?,pyspark,apache-spark-sql,probability,pyspark-dataframes,Pyspark,Apache Spark Sql,Probability,Pyspark Dataframes,我想通过pyspark中列类型的值计算ratings列中评级('A','B','C')的条件概率,而不进行收集 输入: company model rating type 0 ford mustang A coupe 1 chevy camaro B coupe 2 ford fiesta C sedan 3 ford focus A

我想通过pyspark中列类型的值计算ratings列中评级('A','B','C')的条件概率,而不进行收集

输入:

    company     model    rating   type
0   ford       mustang     A      coupe
1   chevy      camaro      B      coupe
2   ford       fiesta      C      sedan
3   ford       focus       A      sedan
4   ford       taurus      B      sedan
5   toyota     camry       B      sedan
输出:

    rating   type    conditional_probability
0     A      coupe   0.50   
1     B      coupe   0.33
2     C      sedan   1.00
3     A      sedan   0.50
4     B      sedan   0.66

您可以使用
groupby
获取单独
rating
s中的项目计数,以及
rating
s和
type
s的单独组合,并使用这些值计算条件概率

from pyspark.sql import functions as F

ratings_cols = ["company", "model", "rating", "type"]
ratings_values = [
    ("ford", "mustang", "A", "coupe"),
    ("chevy", "camaro", "B", "coupe"),
    ("ford", "fiesta", "C", "sedan"),
    ("ford", "focus", "A", "sedan"),
    ("ford", "taurus", "B", "sedan"),
    ("toyota", "camry", "B", "sedan"),
]
ratings_df = spark.createDataFrame(data=ratings_values, schema=ratings_cols)
ratings_df.show()
# +-------+-------+------+-----+                                                  
# |company|  model|rating| type|
# +-------+-------+------+-----+
# |   ford|mustang|     A|coupe|
# |  chevy| camaro|     B|coupe|
# |   ford| fiesta|     C|sedan|
# |   ford|  focus|     A|sedan|
# |   ford| taurus|     B|sedan|
# | toyota|  camry|     B|sedan|
# +-------+-------+------+-----+

probability_df = (ratings_df.groupby(["rating", "type"])
                            .agg(F.count(F.lit(1)).alias("rating_type_count"))
                            .join(ratings_df.groupby("rating").agg(F.count(F.lit(1)).alias("rating_count")), on="rating")
                            .withColumn("conditional_probability", F.round(F.col("rating_type_count")/F.col("rating_count"), 2))
                            .select(["rating", "type", "conditional_probability"])
                            .sort(["type", "rating"]))

probability_df.show()
# +------+-----+-----------------------+                                          
# |rating| type|conditional_probability|
# +------+-----+-----------------------+
# |     A|coupe|                    0.5|
# |     B|coupe|                   0.33|
# |     A|sedan|                    0.5|
# |     B|sedan|                   0.67|
# |     C|sedan|                    1.0|
# +------+-----+-----------------------+

如果解决了问题,请接受答案。:)@Safwan感谢您的回答,这似乎是正确的,但我在pyspark sql函数中使用了窗口概念来解决这个问题,这更有效。