Python 保存Apache Spark管道的中间状态_Python_Apache Spark_Pyspark_Bigdata

Python 保存Apache Spark管道的中间状态

python apache-spark pyspark

Python 保存Apache Spark管道的中间状态,python,apache-spark,pyspark,bigdata,Python,Apache Spark,Pyspark,Bigdata,我有一个非常复杂的apachepyspark管道，它对一组（非常大的）文本文件执行若干转换。我的管道的预期输出是管道的不同阶段。做这件事的最佳方式是什么（即更高效，但也更引人注目，从更适合Spark编程模型和风格的意义上讲）现在，我的代码如下所示： # initialize the pipeline and perform the first set of transformations. ctx = pyspark.SparkContext('local', 'MyPipeline') rd

我有一个非常复杂的apachepyspark管道，它对一组（非常大的）文本文件执行若干转换。我的管道的预期输出是管道的不同阶段。做这件事的最佳方式是什么（即更高效，但也更引人注目，从更适合Spark编程模型和风格的意义上讲）

现在，我的代码如下所示：

# initialize the pipeline and perform the first set of transformations.
ctx = pyspark.SparkContext('local', 'MyPipeline')
rdd = ctx.textFile(...).map(...).map(...)

# first checkpoint: the `first_serialization` function serializes
# the data into properly formatted string. 
rdd..map(first_serialization).saveAsTextFile("ckpt1")

# here, I have to read again from the previously saved checkpoint
# using a `first_deserialization` function that deserializes what has
# been serialized from the `firs_serialization` function. Then performs
# other transformations.
rdd = ctx.textFile("ckpt1").map(...).map(...)

等等。我想摆脱序列化方法和多重保存/读取——顺便问一下，这会影响效率吗？我想是的

有什么提示吗？

提前感谢。

这显然很简单，因为确实如此，但我建议在继续重用现有RDD（侧栏：使用数据集/数据帧而不是RDD来获得更高的性能）的同时编写中间阶段，并继续处理，一边编写中间结果

当您已经对数据进行了处理（最好是缓存！）以供进一步使用时，无需支付从磁盘/网络读取的罚款

使用您自己的代码的示例：

# initialize the pipeline and perform the first set of transformations.
ctx = pyspark.SparkContext('local', 'MyPipeline')
rdd = ctx.textFile(...).map(...).map(...)

# first checkpoint: the `first_serialization` function serializes
# the data into properly formatted string. 
string_rdd = rdd..map(first_serialization)
string_rdd.saveAsTextFile("ckpt1")

# reuse the existing RDD after writing out the intermediate results
rdd = rdd.map(...).map(...) # rdd here is the same variable we used to create the string_rdd results above. alternatively, you may want to use the string_rdd variable here instead of the original rdd variable.

请您改进您的答案，例如添加指向某个示例和/或参考代码的链接？谢谢。@petrux，我用您自己的代码提供了上面的示例。我强烈建议您评估如何使用Spark 2.x（2.2是本文撰写时最新的）数据结构，如Dataset和DataFrame（在python中，只有pyspark sql DataFrame，Dataset不像Scala中那样可用）。@Garrenn:非常感谢。所以我只需要另存为文本文件。好。关于spark版本，我使用的是2.2。但我不知道使用数据帧是否适合我的任务。不管怎样，我会看看的，谢谢你的建议。