Python 如何让Spark看到不同模块中的代码？_Python_Apache Spark_Pyspark

Python 如何让Spark看到不同模块中的代码？

python apache-spark pyspark

Python 如何让Spark看到不同模块中的代码？,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我有一个复杂的函数，我使用map函数在spark中运行一个数据集。它位于不同的python模块中。调用map时，执行器节点没有该代码，然后map函数失败 s_cobDates = getCobDates() #returns a list of dates sb_dataset = sc.broadcast(dataset) #fyi - it is not trivial to slice this into chunks per date def sparkInnerLoop(n_cobD

我有一个复杂的函数，我使用map函数在spark中运行一个数据集。它位于不同的python模块中。调用map时，执行器节点没有该代码，然后map函数失败

s_cobDates = getCobDates() #returns a list of dates
sb_dataset = sc.broadcast(dataset) #fyi - it is not trivial to slice this into chunks per date

def sparkInnerLoop(n_cobDate):
   n_dataset = sb_dataset.value
   import someOtherModule
   return someOtherModule.myComplicatedCalc(n_dataset)

results = s_cobDates.map(sparkInnerLoop).collect()

Spark随后失败，因为它无法导入myOtherModule

到目前为止，我已经通过创建一个包含其他模块的python包并在spark作业之前将其部署到集群来解决这个问题，但这不利于快速原型制作

如何让spark将完整的代码发送到executor节点，而不将所有代码内联到“SparkInerLoop”中？该代码在我的解决方案中的其他地方使用，我不希望代码重复

我在单机模式下使用的是一个8节点集群，版本为1.6.2，驱动程序在我的pycharm工作站上运行。

上述答案有效，如果您的模块是软件包的一部分，它就会崩溃。相反，可以先压缩模块，然后将压缩文件添加到spark上下文中，这样它们就有了正确的包名

def ziplib():
    libpath = os.path.dirname(__file__)  # this should point to your packages directory
    zippath = r'c:\Temp\mylib-' + randstr.randstr(6) + '.zip'
    zippath = os.path.abspath(zippath)
    zf = zipfile.PyZipFile(zippath, mode='w')
    try:
        zf.debug = 3  # making it verbose, good for debugging
        zf.writepy(libpath)
        return zippath  # return path to generated zip archive
    finally:
        zf.close()

sc = SparkContext(conf=conf)

zip_path = ziplib()  # generate zip archive containing your lib
zip_path = pathlib.Path(zip_path).as_uri()
sc.addPyFile(zip_path)  # add the entire archive to SparkContext

上面的答案是有效的，如果你的模块是软件包的一部分，它就会下降。相反，可以先压缩模块，然后将压缩文件添加到spark上下文中，这样它们就有了正确的包名

def ziplib():
    libpath = os.path.dirname(__file__)  # this should point to your packages directory
    zippath = r'c:\Temp\mylib-' + randstr.randstr(6) + '.zip'
    zippath = os.path.abspath(zippath)
    zf = zipfile.PyZipFile(zippath, mode='w')
    try:
        zf.debug = 3  # making it verbose, good for debugging
        zf.writepy(libpath)
        return zippath  # return path to generated zip archive
    finally:
        zf.close()

sc = SparkContext(conf=conf)

zip_path = ziplib()  # generate zip archive containing your lib
zip_path = pathlib.Path(zip_path).as_uri()
sc.addPyFile(zip_path)  # add the entire archive to SparkContext