Scala Spark：连接两个数据帧的更快方法？_Scala_Apache Spark

Scala Spark：连接两个数据帧的更快方法？

scala apache-spark

Scala Spark：连接两个数据帧的更快方法？,scala,apache-spark,Scala,Apache Spark,我有两个数据帧df1和ip2Country。 df1包含IP地址，我正在尝试将IP地址映射到地理位置信息中，如ip2Country中的列经度和纬度我将其作为Spark submit作业运行，但是操作花费了很长时间，即使df1只有不到2500行我的代码： val agg =df1.join(ip2Country, ip2Country("network_start_int")=df1("sint") , "inner") .select($"src_ip" ,$"country_name".a

我有两个数据帧

df1

和

ip2Country

。

df1

包含IP地址，我正在尝试将IP地址映射到地理位置信息中，如

ip2Country

中的列经度和纬度

我将其作为Spark submit作业运行，但是操作花费了很长时间，即使

df1

只有不到2500行

我的代码：

val agg =df1.join(ip2Country, ip2Country("network_start_int")=df1("sint") , "inner") .select($"src_ip" ,$"country_name".alias("scountry") ,$"iso_3".alias("scode") ,$"longitude".alias("slong") ,$"latitude".alias("slat") ,$"dst_ip",$"dint",$"count") .filter($"slong".isNotNull) val agg1 =agg.join(ip2Country, ip2Country("network_start_int")=agg("dint") , "inner") .select($"src_ip",$"scountry" ,$"scode",$"slong" ,$"slat",$"dst_ip" ,$"country_name".alias("dcountry") ,$"iso_3".alias("dcode") ,$"longitude".alias("dlong") ,$"latitude".alias("dlat"),$"count") .filter($"dlong".isNotNull) val agg=df1.join（IP2国家，IP2国家（“网络起始点”）=df1（“sint”） “内部”）。选择（$“src_ip” ，$“国家/地区名称”。别名（“Scontry”），$“iso_3”。别名（“scode”），$“经度”。别名（“slong”），$“纬度”。别名（“板条”），$“dst_ip”，$“力”，$“计数”） .filter（$“slong”.isNotNull） val agg1=agg.join（IP2国家，IP2国家（“网络起始点”）=agg（“力”） “内部”）。选择（$“src_ip”，$“Scontry” ，$“scode”，$“slong” ，$“板条”，“dst_ip” ，$“国家/地区名称”。别名（“数据国家”），$“iso_3”。别名（“dcode”），$“经度”。别名（“dlong”），$“纬度”。别名（“dlat”），$“计数”） .filter（$“dlong”.isNotNull）

有没有其他办法把这两张桌子连接起来？还是我做得不对？

如果你有一个大数据帧，它需要与一个小数据帧连接，广播连接是非常有效的。请看这里：

哪一个agg或agg1需要更多的时间？实际上两者都需要很长的时间。当我打印sys.time时，agg1需要更长的时间argedf.join（广播（smalldf））将在广播提示到框架的地方工作o就像ip2Country.join（广播（df1），…）？是的，请看我的答案，它将清楚地解释为什么它工作得更好。更多的细节解释得很好。如果你喜欢，请投赞成票。谢谢

bigdf.join(broadcast(smalldf))