Apache spark 在for循环内部将map函数附加到PySpark RDD_Apache Spark_Pyspark

Apache spark 在for循环内部将map函数附加到PySpark RDD

apache-spark pyspark

Apache spark 在for循环内部将map函数附加到PySpark RDD,apache-spark,pyspark,Apache Spark,Pyspark,有人能帮我理解在python for循环中将映射函数附加到RDD的行为吗对于以下代码： rdd = spark.sparkContext.parallelize([[1], [2], [3]]) def appender(l, i): return l + [i] for i in range(3): rdd = rdd.map(lambda x: appender(x, i)) rdd.collect() rdd = spark.sparkContext.parall

有人能帮我理解在python for循环中将映射函数附加到RDD的行为吗

对于以下代码：

rdd = spark.sparkContext.parallelize([[1], [2], [3]])

def appender(l, i):
    return l + [i]

for i in range(3):
    rdd = rdd.map(lambda x: appender(x, i))

rdd.collect()

rdd = spark.sparkContext.parallelize([[1], [2], [3]])

def appender(l, i):
    return l + [i]

rdd = rdd.map(lambda x: appender(x, 1))
rdd = rdd.map(lambda x: appender(x, 2))
rdd = rdd.map(lambda x: appender(x, 3))

rdd.collect()

我得到输出：

[[1, 2, 2, 2], [2, 2, 2, 2], [3, 2, 2, 2]]

鉴于以下代码：

rdd = spark.sparkContext.parallelize([[1], [2], [3]])

def appender(l, i):
    return l + [i]

for i in range(3):
    rdd = rdd.map(lambda x: appender(x, i))

rdd.collect()

rdd = spark.sparkContext.parallelize([[1], [2], [3]])

def appender(l, i):
    return l + [i]

rdd = rdd.map(lambda x: appender(x, 1))
rdd = rdd.map(lambda x: appender(x, 2))
rdd = rdd.map(lambda x: appender(x, 3))

rdd.collect()

我得到了预期的输出：

[[1, 1, 2, 3], [2, 1, 2, 3], [3, 1, 2, 3]]

我想这与传递给PySpark编译器的闭包有关，但我找不到任何关于此的文档…

我最好的猜测是因为延迟计算：而且你的射程也不好

这两个代码段产生相同的输出：

rdd = spark.sparkContext.parallelize([[1], [2], [3]])

def appender(l, i):
    return l + [i]

for i in range(1,4):
    rdd = spark.sparkContext.parallelize(rdd.map(lambda x: appender(x, i)).collect())

rdd.collect()

产出：

[[1, 1, 2, 3], [2, 1, 2, 3], [3, 1, 2, 3]]

[[1, 1, 2, 3], [2, 1, 2, 3], [3, 1, 2, 3]]

第二点：

rdd = spark.sparkContext.parallelize([[1], [2], [3]])

rdd = rdd.map(lambda x: appender(x, 1))
rdd = rdd.map(lambda x: appender(x, 2))
rdd = rdd.map(lambda x: appender(x, 3))

rdd.collect()

产出：

[[1, 1, 2, 3], [2, 1, 2, 3], [3, 1, 2, 3]]

[[1, 1, 2, 3], [2, 1, 2, 3], [3, 1, 2, 3]]

另外，要在简化示例（仅输入1和2）中显示for循环中发生的情况，请使用修改后的appender函数打印l参数：

对于循环打印：

[2]
[2, 2]
[1]
[3]
[1, 2]
[3, 2]

作为第一个字段，它从输入列表中获取第二个字段

映射器输出的显式写入为：

[1]
[1, 1]
[2]
[2, 1]
[3]
[3, 1]

解决方案是将所有全局变量（在本例中为

）存储在lambda函数中，以确保正确关闭。这可以通过以下方式实现：

for i in range(3):
    rdd = rdd.map(lambda x, i=i: appender(x, i))

有关这方面的更多信息，请访问

有趣的是，至少在本地集群上（尚未在分布式集群上测试），还可以通过持久化中间rdd来解决此问题：

for i in range(3):
    rdd = rdd.map(lambda x: appender(x, i))
    rdd.persist()

两种解决方案都会产生

[[1, 0, 1, 2], [2, 0, 1, 2], [3, 0, 1, 2]]

嗯，从我（和我的python解释器）的角度来看，这个范围没有问题。我当然也不想并行化rdd.map函数。应使用Parallelize在集群上分发现有集合。请记住，这只是测试伪代码。`Python2.7.12>>>用于范围（3）中的i:。。。打印（i）0 1 2`并在第二个代码段中输入数字：1,2,3