如何计算pyspark数据帧中值的条件概率?
我想通过pyspark中列类型的值计算ratings列中评级('A','B','C')的条件概率,而不进行收集 输入:如何计算pyspark数据帧中值的条件概率?,pyspark,apache-spark-sql,probability,pyspark-dataframes,Pyspark,Apache Spark Sql,Probability,Pyspark Dataframes,我想通过pyspark中列类型的值计算ratings列中评级('A','B','C')的条件概率,而不进行收集 输入: company model rating type 0 ford mustang A coupe 1 chevy camaro B coupe 2 ford fiesta C sedan 3 ford focus A
company model rating type
0 ford mustang A coupe
1 chevy camaro B coupe
2 ford fiesta C sedan
3 ford focus A sedan
4 ford taurus B sedan
5 toyota camry B sedan
输出:
rating type conditional_probability
0 A coupe 0.50
1 B coupe 0.33
2 C sedan 1.00
3 A sedan 0.50
4 B sedan 0.66
您可以使用
groupby
获取单独rating
s中的项目计数,以及rating
s和type
s的单独组合,并使用这些值计算条件概率
from pyspark.sql import functions as F
ratings_cols = ["company", "model", "rating", "type"]
ratings_values = [
("ford", "mustang", "A", "coupe"),
("chevy", "camaro", "B", "coupe"),
("ford", "fiesta", "C", "sedan"),
("ford", "focus", "A", "sedan"),
("ford", "taurus", "B", "sedan"),
("toyota", "camry", "B", "sedan"),
]
ratings_df = spark.createDataFrame(data=ratings_values, schema=ratings_cols)
ratings_df.show()
# +-------+-------+------+-----+
# |company| model|rating| type|
# +-------+-------+------+-----+
# | ford|mustang| A|coupe|
# | chevy| camaro| B|coupe|
# | ford| fiesta| C|sedan|
# | ford| focus| A|sedan|
# | ford| taurus| B|sedan|
# | toyota| camry| B|sedan|
# +-------+-------+------+-----+
probability_df = (ratings_df.groupby(["rating", "type"])
.agg(F.count(F.lit(1)).alias("rating_type_count"))
.join(ratings_df.groupby("rating").agg(F.count(F.lit(1)).alias("rating_count")), on="rating")
.withColumn("conditional_probability", F.round(F.col("rating_type_count")/F.col("rating_count"), 2))
.select(["rating", "type", "conditional_probability"])
.sort(["type", "rating"]))
probability_df.show()
# +------+-----+-----------------------+
# |rating| type|conditional_probability|
# +------+-----+-----------------------+
# | A|coupe| 0.5|
# | B|coupe| 0.33|
# | A|sedan| 0.5|
# | B|sedan| 0.67|
# | C|sedan| 1.0|
# +------+-----+-----------------------+
如果解决了问题,请接受答案。:)@Safwan感谢您的回答,这似乎是正确的,但我在pyspark sql函数中使用了窗口概念来解决这个问题,这更有效。