Python 如何让Spark看到不同模块中的代码?
我有一个复杂的函数,我使用map函数在spark中运行一个数据集。它位于不同的python模块中。调用map时,执行器节点没有该代码,然后map函数失败Python 如何让Spark看到不同模块中的代码?,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我有一个复杂的函数,我使用map函数在spark中运行一个数据集。它位于不同的python模块中。调用map时,执行器节点没有该代码,然后map函数失败 s_cobDates = getCobDates() #returns a list of dates sb_dataset = sc.broadcast(dataset) #fyi - it is not trivial to slice this into chunks per date def sparkInnerLoop(n_cobD
s_cobDates = getCobDates() #returns a list of dates
sb_dataset = sc.broadcast(dataset) #fyi - it is not trivial to slice this into chunks per date
def sparkInnerLoop(n_cobDate):
n_dataset = sb_dataset.value
import someOtherModule
return someOtherModule.myComplicatedCalc(n_dataset)
results = s_cobDates.map(sparkInnerLoop).collect()
Spark随后失败,因为它无法导入myOtherModule
到目前为止,我已经通过创建一个包含其他模块的python包并在spark作业之前将其部署到集群来解决这个问题,但这不利于快速原型制作
如何让spark将完整的代码发送到executor节点,而不将所有代码内联到“SparkInerLoop”中?该代码在我的解决方案中的其他地方使用,我不希望代码重复
我在单机模式下使用的是一个8节点集群,版本为1.6.2,驱动程序在我的pycharm工作站上运行。上述答案有效,如果您的模块是软件包的一部分,它就会崩溃。相反,可以先压缩模块,然后将压缩文件添加到spark上下文中,这样它们就有了正确的包名
def ziplib():
libpath = os.path.dirname(__file__) # this should point to your packages directory
zippath = r'c:\Temp\mylib-' + randstr.randstr(6) + '.zip'
zippath = os.path.abspath(zippath)
zf = zipfile.PyZipFile(zippath, mode='w')
try:
zf.debug = 3 # making it verbose, good for debugging
zf.writepy(libpath)
return zippath # return path to generated zip archive
finally:
zf.close()
sc = SparkContext(conf=conf)
zip_path = ziplib() # generate zip archive containing your lib
zip_path = pathlib.Path(zip_path).as_uri()
sc.addPyFile(zip_path) # add the entire archive to SparkContext
上面的答案是有效的,如果你的模块是软件包的一部分,它就会下降。相反,可以先压缩模块,然后将压缩文件添加到spark上下文中,这样它们就有了正确的包名
def ziplib():
libpath = os.path.dirname(__file__) # this should point to your packages directory
zippath = r'c:\Temp\mylib-' + randstr.randstr(6) + '.zip'
zippath = os.path.abspath(zippath)
zf = zipfile.PyZipFile(zippath, mode='w')
try:
zf.debug = 3 # making it verbose, good for debugging
zf.writepy(libpath)
return zippath # return path to generated zip archive
finally:
zf.close()
sc = SparkContext(conf=conf)
zip_path = ziplib() # generate zip archive containing your lib
zip_path = pathlib.Path(zip_path).as_uri()
sc.addPyFile(zip_path) # add the entire archive to SparkContext