将组PySpark-SQL连接到Pysaprk
我正在尝试使用pyspark基于此SQL查询连接两个表将组PySpark-SQL连接到Pysaprk,sql,apache-spark,pyspark,group-by,aggregate,Sql,Apache Spark,Pyspark,Group By,Aggregate,我正在尝试使用pyspark基于此SQL查询连接两个表 %sql SELECT c.cust_id, avg(b.gender_score) AS pub_masc FROM df c LEFT JOIN pub_df b ON c.pp = b.pp GROUP BY c.cust_id ) 我尝试在pyspark中跟踪,但我不确定这是否正确,因为我一直在显示数据。所以我选择了,麦克斯 df.select('cust_id', 'pp') \ .
%sql
SELECT c.cust_id, avg(b.gender_score) AS pub_masc
FROM df c
LEFT JOIN pub_df b
ON c.pp = b.pp
GROUP BY c.cust_id
)
我尝试在pyspark中跟踪,但我不确定这是否正确,因为我一直在显示数据。所以我选择了,麦克斯
df.select('cust_id', 'pp') \
.join(pub_df, on = ['pp'], how = 'left')\
.avg(gender_score) as pub_masc
.groupBy('cust_id').max()
任何帮助都将不胜感激。
提前感谢您的Python代码包含无效的行
.avg(性别评分)作为pub\u masc
。你也应该先分组然后平均,而不是反过来
import pyspark.sql.functions as F
df.select('cust_id', 'pp') \
.join(pub_df, on = ['pp'], how = 'left')\
.groupBy('cust_id')\
.agg(F.avg('gender_score').alias('pub_masc'))