Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 在spark中加入和分组的有效方法,并最大限度地减少洗牌_Scala_Apache Spark_Join_Group By_Shuffle - Fatal编程技术网

Scala 在spark中加入和分组的有效方法,并最大限度地减少洗牌

Scala 在spark中加入和分组的有效方法,并最大限度地减少洗牌,scala,apache-spark,join,group-by,shuffle,Scala,Apache Spark,Join,Group By,Shuffle,我有两个大数据帧,每个数据帧中大约有几百万条记录 val df1 = Seq( ("k1a","k2a", "g1x","g2x") ,("k1b","k2b", "g1x","g2x") ,("k1c","k2c", "g1x","g2y") ,("k1d","k2d", "g1y","g2y") ,("k1e","k2e", "g1y","g2y") ,("k1f","k2f", "g1z","g2y") ).toDF("key1", "key2", "grp1","grp2") val

我有两个大数据帧,每个数据帧中大约有几百万条记录

val df1 = Seq(
 ("k1a","k2a", "g1x","g2x")
,("k1b","k2b", "g1x","g2x")
,("k1c","k2c", "g1x","g2y")
,("k1d","k2d", "g1y","g2y")
,("k1e","k2e", "g1y","g2y")
,("k1f","k2f", "g1z","g2y")
).toDF("key1", "key2", "grp1","grp2")

val df2 = Seq(
 ("k1a","k2a", "v4a")
,("k1b","k2b", "v4b")
,("k1c","k2c", "v4c")
,("k1d","k2d", "v4d")
,("k1e","k2e", "v4e")
,("k1f","k2f", "v4f")
).toDF("key1", "key2", "fld4")
我正在尝试加入并执行下面的groupBy,但这是永远的结果。df1中大约有一百万个grp1+grp2数据的唯一实例

val df3 = df1.join(df2,Seq("key1","key2"))
val df4 = df3.groupBy("grp1","grp2").agg(collect_list(struct($"key1",$"key2")).as("dups")).filter("size(dups)>1")

有没有办法减少洗牌?对于这两种情况,mapPartitions方法正确吗?有人能举个例子来建议一种有效的方法吗。

嗨,df4的计算中没有使用df3。这是个错误吗?还可以告诉我们在创建df4之后您对它做了什么?Spark由于懒惰,您编写的代码不会触发任何东西。在保存df4(收集或保存)之前查看您是如何操作它的,这可能有助于我们告诉您为什么需要这么长时间。您好。。我修正了df4的计算。基本上,这是一个简化的场景。执行分组后,I a)存储df4结果b)通过从当前df3中删除分组df4中的key1/key2组合来计算新df3 c)通过在新df3上执行另一组来计算新df4(通过grpy、grpx字段,本例中未提及)步骤a-c使用不同的GROUPBY标准重复两次以上的迭代。您是说在df1中,元组(key1,key2)是唯一的。关于df2,是否存在重复的(key1,key2)元组?元组(key1,key2)在df1和df2中都是唯一的。mapPartitions不适用于DFs