Apache spark （Py）Spark中的缓存和循环_Apache Spark_Pyspark_While Loop_Sequential

Apache spark （Py）Spark中的缓存和循环

apache-spark pyspark

Apache spark （Py）Spark中的缓存和循环,apache-spark,pyspark,while-loop,sequential,Apache Spark,Pyspark,While Loop,Sequential,我知道使用Spark时，通常要避免“for”和“while”循环。我的问题是如何优化“while”循环，不过如果我错过了一个使其变得不必要的解决方案，我会洗耳恭听我不确定我是否能用玩具数据演示这个问题（处理时间很长，随着循环的进行而变得复杂），但这里有一些伪代码： ### I have a function - called 'enumerator' - which involves several joins and window functions. # I run this funct

我知道使用Spark时，通常要避免“for”和“while”循环。我的问题是如何优化“while”循环，不过如果我错过了一个使其变得不必要的解决方案，我会洗耳恭听

我不确定我是否能用玩具数据演示这个问题（处理时间很长，随着循环的进行而变得复杂），但这里有一些伪代码：

### I have a function - called 'enumerator' - which involves several joins and window functions. 
# I run this function on my base dataset, df0, and return df1
df1 = enumerator(df0, param1 = apple, param2 = banana)

# Check for some condition in df1, then count number of rows in the result
counter = df1 \
.filter(col('X') == some_condition) \
.count()

# If there are rows meeting this condition, start a while loop
while counter > 0:
  print('Starting with counter: ', str(counter))
  
  # Run the enumerator function on df1 again
  df2 = enumerator(df1, param1= apple, param2 = banana)
  
  # Check for the condition again, then continue the while loop if necessary
  counter = df2 \
  .filter(col('X') == some_condition) \
  .count()
  
  df1 = df2

# After the while loop finishes, I take the last resulting dataframe and I will do several more operations and analyses downstream  
final_df = df2

枚举器函数的一个重要方面是“回顾”窗口中的序列，因此在进行所有必要的更正之前可能需要运行几次

在我心里，我知道这很难看，但函数中的窗口/排名/顺序分析是至关重要的。我的理解是，随着循环的继续，底层Spark查询计划变得越来越复杂。在这种情况下，我应该采用哪些最佳做法？我应该在任何时候缓存数据帧吗？是在while循环开始之前，还是在循环本身内？

您肯定应该缓存/持久化数据帧，否则

while

循环中的每个迭代都将从

df0

从头开始。此外，您可能希望取消对已使用数据帧的持久化，以释放磁盘/内存空间

要优化的另一点不是进行

计数

，而是使用更便宜的操作，例如

df.take（1）

。如果该值不返回任何值，则

计数器==0

df1 = enumerator(df0, param1 = apple, param2 = banana)
df1.cache()

# Check for some condition in df1, then count number of rows in the result
counter = len(df1.filter(col('X') == some_condition).take(1))

while counter > 0:
  print('Starting with counter: ', str(counter))
  
  df2 = enumerator(df1, param1 = apple, param2 = banana)
  df2.cache()

  counter = len(df2.filter(col('X') == some_condition).take(1))
  df1.unpersist()    # unpersist df1 as it will be overwritten
  
  df1 = df2

final_df = df2

非常感谢。在这些场景中使用“.take（1）”，这是我以前没有考虑过的，但看起来很明显。我想如果我使用“计数器>0”这样的条件（当它达到1时停止），Spark查询计划器可能会在引擎盖下完成同样的事情