Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/285.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何提高pyspark连接的性能_Python_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 如何提高pyspark连接的性能

Python 如何提高pyspark连接的性能,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有两个数据帧,如下所示: df1(2000万行): df2(50行): 我想通过比较df1和df2中的lat long,在df1中获得一个新的列“state”。从下面的数据帧中,lat long上的Join将提供零记录,因此我使用一个阈值,并使用该阈值执行Join操作: threshold = F.lit(3) def lat_long_approximation(col1, col2, threshold): return F.abs(col1 - col2) < thres

我有两个数据帧,如下所示:

df1(2000万行):

df2(50行):

我想通过比较df1和df2中的lat long,在df1中获得一个新的列“state”。从下面的数据帧中,lat long上的Join将提供零记录,因此我使用一个阈值,并使用该阈值执行Join操作:

threshold = F.lit(3) 
def lat_long_approximation(col1, col2, threshold):
    return F.abs(col1 - col2) < threshold

df3 = df1.join(F.broadcast(df2), lat_long_approximation(df1.lat, df_state.lat, threshold) & lat_long_approximation(df1.long, df_state.long, threshold))
threshold=F.lit(3)
def lat_long_近似值(col1,col2,阈值):
返回F.abs(col1-col2)<阈值
df3=df1.join(F.broadcast(df2),lat_long_近似(df1.lat,df_state.lat,阈值)和lat_long_近似(df1.long,df_state.long,阈值))

这需要很长时间。有谁能帮助我优化这个连接或任何更好的方法,避免使用单独的函数(lat_long_近似)

你可以在
之间使用
。我对演出没有把握

threshold = 10 # for test
df1.join(F.broadcast(df2), 
         df1.lat.between(df2.lat - threshold, df2.lat + threshold) & 
         df1.long.between(df2.long - threshold, df2.long + threshold), "left").show()
threshold = F.lit(3) 
def lat_long_approximation(col1, col2, threshold):
    return F.abs(col1 - col2) < threshold

df3 = df1.join(F.broadcast(df2), lat_long_approximation(df1.lat, df_state.lat, threshold) & lat_long_approximation(df1.long, df_state.long, threshold))
threshold = 10 # for test
df1.join(F.broadcast(df2), 
         df1.lat.between(df2.lat - threshold, df2.lat + threshold) & 
         df1.long.between(df2.long - threshold, df2.long + threshold), "left").show()