Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/359.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/arduino/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何让Spark看到不同模块中的代码?_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 如何让Spark看到不同模块中的代码?

Python 如何让Spark看到不同模块中的代码?,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我有一个复杂的函数,我使用map函数在spark中运行一个数据集。它位于不同的python模块中。调用map时,执行器节点没有该代码,然后map函数失败 s_cobDates = getCobDates() #returns a list of dates sb_dataset = sc.broadcast(dataset) #fyi - it is not trivial to slice this into chunks per date def sparkInnerLoop(n_cobD

我有一个复杂的函数,我使用map函数在spark中运行一个数据集。它位于不同的python模块中。调用map时,执行器节点没有该代码,然后map函数失败

s_cobDates = getCobDates() #returns a list of dates
sb_dataset = sc.broadcast(dataset) #fyi - it is not trivial to slice this into chunks per date

def sparkInnerLoop(n_cobDate):
   n_dataset = sb_dataset.value
   import someOtherModule
   return someOtherModule.myComplicatedCalc(n_dataset)

results = s_cobDates.map(sparkInnerLoop).collect()
Spark随后失败,因为它无法导入myOtherModule

到目前为止,我已经通过创建一个包含其他模块的python包并在spark作业之前将其部署到集群来解决这个问题,但这不利于快速原型制作

如何让spark将完整的代码发送到executor节点,而不将所有代码内联到“SparkInerLoop”中?该代码在我的解决方案中的其他地方使用,我不希望代码重复


我在单机模式下使用的是一个8节点集群,版本为1.6.2,驱动程序在我的pycharm工作站上运行。

上述答案有效,如果您的模块是软件包的一部分,它就会崩溃。相反,可以先压缩模块,然后将压缩文件添加到spark上下文中,这样它们就有了正确的包名

def ziplib():
    libpath = os.path.dirname(__file__)  # this should point to your packages directory
    zippath = r'c:\Temp\mylib-' + randstr.randstr(6) + '.zip'
    zippath = os.path.abspath(zippath)
    zf = zipfile.PyZipFile(zippath, mode='w')
    try:
        zf.debug = 3  # making it verbose, good for debugging
        zf.writepy(libpath)
        return zippath  # return path to generated zip archive
    finally:
        zf.close()

sc = SparkContext(conf=conf)

zip_path = ziplib()  # generate zip archive containing your lib
zip_path = pathlib.Path(zip_path).as_uri()
sc.addPyFile(zip_path)  # add the entire archive to SparkContext

上面的答案是有效的,如果你的模块是软件包的一部分,它就会下降。相反,可以先压缩模块,然后将压缩文件添加到spark上下文中,这样它们就有了正确的包名

def ziplib():
    libpath = os.path.dirname(__file__)  # this should point to your packages directory
    zippath = r'c:\Temp\mylib-' + randstr.randstr(6) + '.zip'
    zippath = os.path.abspath(zippath)
    zf = zipfile.PyZipFile(zippath, mode='w')
    try:
        zf.debug = 3  # making it verbose, good for debugging
        zf.writepy(libpath)
        return zippath  # return path to generated zip archive
    finally:
        zf.close()

sc = SparkContext(conf=conf)

zip_path = ziplib()  # generate zip archive containing your lib
zip_path = pathlib.Path(zip_path).as_uri()
sc.addPyFile(zip_path)  # add the entire archive to SparkContext