将组PySpark-SQL连接到Pysaprk

将组PySpark-SQL连接到Pysaprk,sql,apache-spark,pyspark,group-by,aggregate,Sql,Apache Spark,Pyspark,Group By,Aggregate,我正在尝试使用pyspark基于此SQL查询连接两个表 %sql SELECT c.cust_id, avg(b.gender_score) AS pub_masc FROM df c LEFT JOIN pub_df b ON c.pp = b.pp GROUP BY c.cust_id ) 我尝试在pyspark中跟踪,但我不确定这是否正确,因为我一直在显示数据。所以我选择了,麦克斯 df.select('cust_id', 'pp') \ .

我正在尝试使用pyspark基于此SQL查询连接两个表

%sql
SELECT c.cust_id, avg(b.gender_score) AS pub_masc
FROM df  c
 LEFT JOIN pub_df b 
   ON c.pp = b.pp 
GROUP BY c.cust_id
)
我尝试在pyspark中跟踪,但我不确定这是否正确,因为我一直在显示数据。所以我选择了,麦克斯

df.select('cust_id', 'pp') \
                .join(pub_df, on = ['pp'], how = 'left')\
                .avg(gender_score) as pub_masc
                .groupBy('cust_id').max()
任何帮助都将不胜感激。
提前感谢

您的Python代码包含无效的行
.avg(性别评分)作为pub\u masc
。你也应该先分组然后平均,而不是反过来

import pyspark.sql.functions as F

df.select('cust_id', 'pp') \
  .join(pub_df, on = ['pp'], how = 'left')\
  .groupBy('cust_id')\
  .agg(F.avg('gender_score').alias('pub_masc'))