Python 在Spark中导入自定义累加器类型_Python_Apache Spark_Import_Pyspark_Accumulator

Python 在Spark中导入自定义累加器类型

python apache-spark import pyspark

Python 在Spark中导入自定义累加器类型,python,apache-spark,import,pyspark,accumulator,Python,Apache Spark,Import,Pyspark,Accumulator,我正试图使用一个自定义的累加器类。如果我在本地定义该类，这是可行的，但是当我尝试在另一个模块中定义它并使用sc.addPyFile导入文件时，我会得到一个ImportError 在导入rdd.foreach中的helper函数时，我遇到了相同的问题，我可以通过在foreach的函数（下面的示例）中执行import来解决这个问题。然而，同样的修复对自定义累加器不起作用（我也不希望它真的起作用） tl；dr：导入自定义累加器类的正确方法是什么扩展/累加器.py： class ArrayAccumu

我正试图使用一个自定义的累加器类。如果我在本地定义该类，这是可行的，但是当我尝试在另一个模块中定义它并使用

sc.addPyFile

导入文件时，我会得到一个

ImportError

在导入

rdd.foreach

中的helper函数时，我遇到了相同的问题，我可以通过在foreach的函数（下面的示例）中执行

import

来解决这个问题。然而，同样的修复对自定义累加器不起作用（我也不希望它真的起作用）

tl；dr：导入自定义累加器类的正确方法是什么

扩展/累加器.py：

class ArrayAccumulatorParam(pyspark.AccumulatorParam):
    def zero(self, initialValue):
        return numpy.zeros(initialValue.shape)

    def addInPlace(self, a, b):
        a += b
        return a

run/count.py：

from extensions.accumulators import ArrayAccumulatorParam

def main(sc):
    sc.addPyFile(LIBRARY_PATH + '/import_/logs.py')
    sc.addPyFile(LIBRARY_PATH + '/extensions/accumulators.py')

    rdd = sc.textFile(LOGS_PATH)
    accum = sc.accumulator(numpy.zeros(DIMENSIONS), ArrayAccumulatorParam())

    def count(row)
        import logs # This 'internal import' seems to be required to avoid ImportError for the 'logs' module
        from extensions.accumulators import ArrayAccumulatorParam # Error is thrown both with and without this line

        val = logs.parse(row)
        accum.add(val)

    rdd.foreach(count) # Throws ImportError: No module named extensions.accumulators

if __name__ == '__main__':
    conf = pyspark.SparkConf().setAppName('SOME_COUNT_JOB')
    sc = pyspark.SparkContext(conf=conf)
    main(sc)

错误：

ImportError: No module named extensions.accumulators

对于自定义蓄能器使用（pyspark.蓄能器.蓄能器参数）对于自定义蓄能器使用（pyspark.蓄能器.蓄能器参数）