Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark (Py)Spark中的缓存和循环_Apache Spark_Pyspark_While Loop_Sequential - Fatal编程技术网

Apache spark (Py)Spark中的缓存和循环

Apache spark (Py)Spark中的缓存和循环,apache-spark,pyspark,while-loop,sequential,Apache Spark,Pyspark,While Loop,Sequential,我知道使用Spark时,通常要避免“for”和“while”循环。我的问题是如何优化“while”循环,不过如果我错过了一个使其变得不必要的解决方案,我会洗耳恭听 我不确定我是否能用玩具数据演示这个问题(处理时间很长,随着循环的进行而变得复杂),但这里有一些伪代码: ### I have a function - called 'enumerator' - which involves several joins and window functions. # I run this funct

我知道使用Spark时,通常要避免“for”和“while”循环。我的问题是如何优化“while”循环,不过如果我错过了一个使其变得不必要的解决方案,我会洗耳恭听

我不确定我是否能用玩具数据演示这个问题(处理时间很长,随着循环的进行而变得复杂),但这里有一些伪代码:

### I have a function - called 'enumerator' - which involves several joins and window functions. 
# I run this function on my base dataset, df0, and return df1
df1 = enumerator(df0, param1 = apple, param2 = banana)

# Check for some condition in df1, then count number of rows in the result
counter = df1 \
.filter(col('X') == some_condition) \
.count()

# If there are rows meeting this condition, start a while loop
while counter > 0:
  print('Starting with counter: ', str(counter))
  
  # Run the enumerator function on df1 again
  df2 = enumerator(df1, param1= apple, param2 = banana)
  
  # Check for the condition again, then continue the while loop if necessary
  counter = df2 \
  .filter(col('X') == some_condition) \
  .count()
  
  df1 = df2

# After the while loop finishes, I take the last resulting dataframe and I will do several more operations and analyses downstream  
final_df = df2
枚举器函数的一个重要方面是“回顾”窗口中的序列,因此在进行所有必要的更正之前可能需要运行几次


在我心里,我知道这很难看,但函数中的窗口/排名/顺序分析是至关重要的。我的理解是,随着循环的继续,底层Spark查询计划变得越来越复杂。在这种情况下,我应该采用哪些最佳做法?我应该在任何时候缓存数据帧吗?是在while循环开始之前,还是在循环本身内?

您肯定应该缓存/持久化数据帧,否则
while
循环中的每个迭代都将从
df0
从头开始。此外,您可能希望取消对已使用数据帧的持久化,以释放磁盘/内存空间

要优化的另一点不是进行
计数
,而是使用更便宜的操作,例如
df.take(1)
。如果该值不返回任何值,则
计数器==0

df1 = enumerator(df0, param1 = apple, param2 = banana)
df1.cache()

# Check for some condition in df1, then count number of rows in the result
counter = len(df1.filter(col('X') == some_condition).take(1))

while counter > 0:
  print('Starting with counter: ', str(counter))
  
  df2 = enumerator(df1, param1 = apple, param2 = banana)
  df2.cache()

  counter = len(df2.filter(col('X') == some_condition).take(1))
  df1.unpersist()    # unpersist df1 as it will be overwritten
  
  df1 = df2

final_df = df2
非常感谢。在这些场景中使用“.take(1)”,这是我以前没有考虑过的,但看起来很明显。我想如果我使用“计数器>0”这样的条件(当它达到1时停止),Spark查询计划器可能会在引擎盖下完成同样的事情