Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark pyspark跳过了某些导入阶段,此阶段应处理新数据,不应跳过_Apache Spark_Pyspark - Fatal编程技术网

Apache spark pyspark跳过了某些导入阶段,此阶段应处理新数据,不应跳过

Apache spark pyspark跳过了某些导入阶段,此阶段应处理新数据,不应跳过,apache-spark,pyspark,Apache Spark,Pyspark,我的核心pyspark代码在forloop中 global_v = np.random.rand(feature_num, vec_dim) global_v_bc = sc.broadcast(global_v) for i in range(epoch): s1 = train_rdd.repartition(500)\ .mapPartitions(lambda x:fit(x, global_v_bc))\ .reduceByKey

我的核心pyspark代码在forloop中

global_v = np.random.rand(feature_num, vec_dim)
global_v_bc = sc.broadcast(global_v)
for i in range(epoch):
    s1 = train_rdd.repartition(500)\
            .mapPartitions(lambda x:fit(x, global_v_bc))\
            .reduceByKey(lambda x,y:[x[0]+y[0], x[1]+y[1], x[2]+y[2]])\
            .map(lambda x:x[1])
    s2 = s1.take(1)[0]
    s1.unpersist()
    logging.info("epoch {} train-loss:{}".format(i, s2[0]/s2[1]))
    global_v_bc.destroy()
    global_v_bc = sc.broadcast(s2[2]/s2[1])
这是一个简单的调频模式。它可以更新每个分区中的v并为下一个循环重新广播它,但在运行时会跳过许多阶段

有人能帮我吗~

以下是web UI内容:

这是否回答了您的问题@cronoik,我已经对数据进行了重新分区,更新了广播值并取消了缓存rdd的持久化,所以我认为这个阶段不应该被跳过。您是否已经调试以检查您的全局变量是否没有被更新?跳过哪些任务?这可能是因为它只是跳过了对train\rdd的初始s1调用。因此,形成列rdd只需进行一次,然后每隔一次就跳过一次,因为它是常量。。。。