Apache spark Pyspark中的条件计数
我有以下代码:Apache spark Pyspark中的条件计数,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我有以下代码: output = (assignations .join(activations,['customer_id','external_id'],'left') .join(redeemers,['customer_id','external_id'],'left') .groupby('external_id') .agg(f.expr('COUNT(DISTINCT(CASE WHEN assigna
output = (assignations
.join(activations,['customer_id','external_id'],'left')
.join(redeemers,['customer_id','external_id'],'left')
.groupby('external_id')
.agg(f.expr('COUNT(DISTINCT(CASE WHEN assignation = 1 THEN customer_id ELSE NULL END))').alias('assigned'),
f.expr('COUNT(DISTINCT(CASE WHEN activation = 1 THEN customer_id ELSE NULL END))').alias('activated'),
f.expr('COUNT(DISTINCT(CASE WHEN redeemer = 1 THEN customer_id ELSE NULL END))').alias('redeemed'))
)
output = (assignations
.join(activations,['customer_id','external_id'],'left')
.join(redeemers,['customer_id','external_id'],'left')
.groupby('external_id')
.agg(f.count(f.when(f.col('assignation')==1,True).alias('assigned')),
f.count(f.when(f.col('activation')==1,True).alias('activated')),
f.count(f.when(f.col('redeemer')==1,True).alias('redeem'))
))
此代码提供以下输出:
external_id assigned activated redeemed
DISC0000089309 31968 901 491
DISC0000089428 31719 893 514
DISC0000089283 2617 60 39
我的想法是将部分的情况转换为更具Pythonic/Pyspark风格的代码。这就是我尝试以下代码的原因:
output = (assignations
.join(activations,['customer_id','external_id'],'left')
.join(redeemers,['customer_id','external_id'],'left')
.groupby('external_id')
.agg(f.expr('COUNT(DISTINCT(CASE WHEN assignation = 1 THEN customer_id ELSE NULL END))').alias('assigned'),
f.expr('COUNT(DISTINCT(CASE WHEN activation = 1 THEN customer_id ELSE NULL END))').alias('activated'),
f.expr('COUNT(DISTINCT(CASE WHEN redeemer = 1 THEN customer_id ELSE NULL END))').alias('redeemed'))
)
output = (assignations
.join(activations,['customer_id','external_id'],'left')
.join(redeemers,['customer_id','external_id'],'left')
.groupby('external_id')
.agg(f.count(f.when(f.col('assignation')==1,True).alias('assigned')),
f.count(f.when(f.col('activation')==1,True).alias('activated')),
f.count(f.when(f.col('redeemer')==1,True).alias('redeem'))
))
问题是输出不一样,数字不匹配。如何转换代码以获得相同的输出?您可以使用f.countDistinct
实现Spark SQL中的COUNT(DISTINCT)
的等效功能:
output = (assignations
.join(activations,['customer_id','external_id'],'left')
.join(redeemers,['customer_id','external_id'],'left')
.groupby('external_id')
.agg(
f.countDistinct(f.when(f.col('assignation') == 1, f.col('customer_id'))).alias('assigned'),
f.countDistinct(f.when(f.col('activation') == 1, f.col('customer_id'))).alias('activated'),
f.countDistinct(f.when(f.col('redeemer') == 1, f.col('customer_id'))).alias('redeemed')
)
)
您可以使用f.countDistinct
在Spark SQL中实现与COUNT(DISTINCT)
等效的功能:
output = (assignations
.join(activations,['customer_id','external_id'],'left')
.join(redeemers,['customer_id','external_id'],'left')
.groupby('external_id')
.agg(
f.countDistinct(f.when(f.col('assignation') == 1, f.col('customer_id'))).alias('assigned'),
f.countDistinct(f.when(f.col('activation') == 1, f.col('customer_id'))).alias('activated'),
f.countDistinct(f.when(f.col('redeemer') == 1, f.col('customer_id'))).alias('redeemed')
)
)
使用countDistict
代替count或groupBy('colname).count().orderBy()
。使用countDistict
代替count或groupBy('colname).count().orderBy()
。