PySpark数据帧操作效率

PySpark数据帧操作效率,pyspark,spark-dataframe,rdd,Pyspark,Spark Dataframe,Rdd,假设我有以下数据框: +----------+-----+----+-------+ |display_id|ad_id|prob|clicked| +----------+-----+----+-------+ | 123| 989| 0.9| 0| | 123| 990| 0.8| 1| | 123| 999| 0.7| 0| | 234| 789| 0.9| 0| | 234| 7

假设我有以下数据框:

+----------+-----+----+-------+
|display_id|ad_id|prob|clicked|
+----------+-----+----+-------+
|       123|  989| 0.9|      0|
|       123|  990| 0.8|      1|
|       123|  999| 0.7|      0|
|       234|  789| 0.9|      0|
|       234|  777| 0.7|      0|
|       234|  769| 0.6|      1|
|       234|  798| 0.5|      0|
+----------+-----+----+-------+
然后,我执行以下操作以获得最终数据集(代码如下所示):


有没有更有效的方法?以我现在的方式进行这组转换似乎是我代码中的瓶颈。如果有任何反馈,我将不胜感激

我没有做过任何时间比较,但我认为通过不使用任何UDF,Spark应该能够优化自身

#scala:  val dfad = sc.parallelize(Seq((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF("display_id","ad_id","prob","clicked")
#^^^that's^^^ the only difference (besides putting val in front of variables) between this python response and a Scala one

dfad = sc.parallelize(((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF(["display_id","ad_id","prob","clicked"])
dfad.registerTempTable("df_ad")



df1 = sqlContext.sql("SELECT display_id,collect_list(ad_id) ad_id_sorted FROM (SELECT * FROM df_ad SORT BY display_id,prob DESC) x GROUP BY display_id")
+----------+--------------------+
|display_id|        ad_id_sorted|
+----------+--------------------+
|       234|[789, 777, 769, 798]|
|       123|     [989, 990, 999]|
+----------+--------------------+

df2 = sqlContext.sql("SELECT display_id, max(ad_id) as ad_id_set from df_ad where clicked=1 group by display_id")
+----------+---------+
|display_id|ad_id_set|
+----------+---------+
|       234|      769|
|       123|      990|
+----------+---------+


final_df = df1.join(df2,"display_id")
+----------+--------------------+---------+
|display_id|        ad_id_sorted|ad_id_set|
+----------+--------------------+---------+
|       234|[789, 777, 769, 798]|      769|
|       123|     [989, 990, 999]|      990|
+----------+--------------------+---------+
我没有将ad_id_集放入数组,因为您正在计算max,max应该只返回1个值。我相信,如果你真的需要它在一个数组中,你可以做到这一点


如果将来有人使用Scala时遇到类似的问题,我会介绍Scala的细微差别。

感谢您提供此解决方案。我对两种解决方案都计时。您的解决方案在1.38毫秒内执行,原始解决方案在2.01毫秒内执行:)
#scala:  val dfad = sc.parallelize(Seq((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF("display_id","ad_id","prob","clicked")
#^^^that's^^^ the only difference (besides putting val in front of variables) between this python response and a Scala one

dfad = sc.parallelize(((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF(["display_id","ad_id","prob","clicked"])
dfad.registerTempTable("df_ad")



df1 = sqlContext.sql("SELECT display_id,collect_list(ad_id) ad_id_sorted FROM (SELECT * FROM df_ad SORT BY display_id,prob DESC) x GROUP BY display_id")
+----------+--------------------+
|display_id|        ad_id_sorted|
+----------+--------------------+
|       234|[789, 777, 769, 798]|
|       123|     [989, 990, 999]|
+----------+--------------------+

df2 = sqlContext.sql("SELECT display_id, max(ad_id) as ad_id_set from df_ad where clicked=1 group by display_id")
+----------+---------+
|display_id|ad_id_set|
+----------+---------+
|       234|      769|
|       123|      990|
+----------+---------+


final_df = df1.join(df2,"display_id")
+----------+--------------------+---------+
|display_id|        ad_id_sorted|ad_id_set|
+----------+--------------------+---------+
|       234|[789, 777, 769, 798]|      769|
|       123|     [989, 990, 999]|      990|
+----------+--------------------+---------+