Apache spark Pypark联盟的替代方案_Apache Spark_Pyspark

Apache spark Pypark联盟的替代方案

apache-spark pyspark

Apache spark Pypark联盟的替代方案,apache-spark,pyspark,Apache Spark,Pyspark,我编写了一个代码片段来执行以下操作： 1.从数据帧（df1）中为每个层取n行 2.按阶层对行进行排序 3.用另一个数据帧（df2）中的数据替换其中一列中的数据 4.联合数据帧（df1和df2）我知道在spark中，unionall是一项昂贵的操作。是否有另一种/更有效、更快的方法来实现同样的目标。谢谢 SeedWindow = Window.orderBy("SeedEmail") AlphaOutputWindow = Window.partitionBy("col1").orderBy("

我编写了一个代码片段来执行以下操作： 1.从数据帧（df1）中为每个层取n行 2.按阶层对行进行排序 3.用另一个数据帧（df2）中的数据替换其中一列中的数据 4.联合数据帧（df1和df2）

我知道在spark中，

unionall

是一项昂贵的操作。是否有另一种/更有效、更快的方法来实现同样的目标。谢谢

SeedWindow = Window.orderBy("SeedEmail")
AlphaOutputWindow = Window.partitionBy("col1").orderBy("col2")

seedEmails = (seeds.filter(pos_filter_cond).select("col1....col2")
        .distinct().withColumn("row_id",row_number().over(SeedWindow)))

seedCounts = seedEmails.count()

sampleForSeed = (final_result.withColumn("row_id",row_number().over(AlphaOutputWindow))
        .filter("row_id <= "+str(seedCounts))
    )

sampleAfterSeed = (sampleForSeed.join(seedEmails, ["cols"], "inner"))

finalOutputColumns = [col for col in final_result_moduleCount.columns]

final_result_moduleCount = final_result_moduleCount.select(finalOutputColumns).unionAll(sampleAfterSeed.select(finalOutputColumns))

SeedWindow=Window.orderBy（“SeedEmail”）
AlphaOutputWindow=Window.partitionBy（“col1”）.orderBy（“col2”）
seedEmails=（seeds.filter（pos\u filter\u cond）.选择（“col1…col2”）
.distinct（）.withColumn（“行id”，行编号（）.over（种子窗口）））
seedCounts=seedEmails.count（）
sampleForSeed=（最终结果。带列（“行id”，行编号（）。超过（AlphaOutputWindow））
.过滤器("row_id因此，如果我为每个层采样n行，然后对其进行排序，那么就不用对整个总体进行排序了。这样做会减少运行时间吗？因为窗口函数将在最多100行的数据帧上运行？谢谢。因此，如果我为每个层采样n行，然后进行排序，而不用对整个总体进行排序这样做是否会减少运行时间，因为窗口函数将在最多100行的数据帧上运行？谢谢。