Apache spark 如何将深度学习模型数据传递给Spark中的map函数

Apache spark 如何将深度学习模型数据传递给Spark中的map函数,apache-spark,keras,pyspark,deep-learning,Apache Spark,Keras,Pyspark,Deep Learning,我有一个非常简单的用例,使用sc.binaryFiles方法从s3中读取大量图像作为rdd。创建此RDD后,我将把RDD中的内容传递给vgg16功能提取器函数。因此,在这里,我需要模型数据来进行特征提取,所以我将模型数据放入广播变量中,然后在每个映射函数中获取值。代码如下:- s3_files_rdd = sc.binaryFiles(RESOLVED_IMAGE_PATH) s3_files_rdd.persist() model_data = initVGG16() broadcast_

我有一个非常简单的用例,使用sc.binaryFiles方法从s3中读取大量图像作为rdd。创建此RDD后,我将把RDD中的内容传递给vgg16功能提取器函数。因此,在这里,我需要模型数据来进行特征提取,所以我将模型数据放入广播变量中,然后在每个映射函数中获取值。代码如下:-

s3_files_rdd = sc.binaryFiles(RESOLVED_IMAGE_PATH)

s3_files_rdd.persist()

model_data = initVGG16()
broadcast_model = sc.broadcast(model_data)

features_rdd = s3_files_rdd.mapPartitions(extract_features_)

response_rdd = features_rdd.map(lambda x: (x[0], write_to_s3(x, OUTPUT, FORMAT_NAME)))
提取特征方法:-

def extract_features_(xs):
    model_data = initVGG16()
    for k, v in xs:
        yield k, extract_features2(model_data,v)
提取特征方法:-

from keras.preprocessing import image
from keras.applications.vgg16 import VGG16
from keras.models import Model
from io import BytesIO
from keras.applications.vgg16 import preprocess_input
def extract_features(model,obj):
    try:
        print('executing vgg16 feature extractor...')
        img = image.load_img(BytesIO(obj), target_size=(224, 224,3))
        img_data = image.img_to_array(img)
        img_data = np.expand_dims(img_data, axis=0)
        img_data = preprocess_input(img_data)
        vgg16_feature = model.predict(img_data)[0]
        print('++++++++++++++++++++++++++++',vgg16_feature.shape)
        return vgg16_feature
    except Exception as e:
        print('Error......{}'.format(e.args))
        return []
写入s3方法:-

def write_to_s3(rdd, output_path, format_name):
    file_path = rdd[0]
    file_name_without_ext = get_file_name_without_ext(file_name)
    bucket_name = output_path.split('/', 1)[0]

    final_path = 'deepak' + '/' + file_name_without_ext + '.' + format_name

    LOGGER.info("Saving to S3....")
    cci = cc.get_interface(bucket_name, ACCESS_KEY=os.environ.get("AWS_ACCESS_KEY_ID"),
                           SECRET_KEY=os.environ.get("AWS_SECRET_ACCESS_KEY"), endpoint_url='https://s3.amazonaws.com')
    response = cci.upload_npy_array(final_path, rdd[1])
    return response
在write_to_s3方法中,我获取RDD,提取要保存的密钥名和bucket。然后使用一个名为cottoncandy的库来直接保存RDD内容,在我的例子中它是numpy数组,而不是保存任何中间文件

我得到以下错误:-

127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 600, in save_reduce
    save(state)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/usr/lib64/python2.7/pickle.py", line 687, in _batch_setitems
    save(v)
  File "/usr/lib64/python2.7/pickle.py", line 306, in save
    rv = reduce(self.proto)
TypeError: can't pickle thread.lock objects
Traceback (most recent call last):
  File "one_file5.py", line 98, in <module>
    run()
  File "one_file5.py", line 89, in run
    LOGGER.info('features_rdd rdd created,...... %s',features_rdd.count())    
  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 1041, in count
  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 1032, in sum
  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 906, in fold
  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 809, in collect
  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd
  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2388, in _wrap_function
  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2374, in _prepare_for_python_RDD
  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/serializers.py", line 464, in dumps
  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 704, in dumps
  File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 162, in dump
pickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock objects.
127_0010/container_154157615027_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py”,第600行,保存
保存(状态)
文件“/usr/lib64/python2.7/pickle.py”,第286行,保存
f(self,obj)#用显式self调用未绑定方法
保存目录中第655行的文件“/usr/lib64/python2.7/pickle.py”
self.\u batch\u setitems(obj.iteritems())
文件“/usr/lib64/python2.7/pickle.py”,第687行,在批处理设置项中
保存(v)
文件“/usr/lib64/python2.7/pickle.py”,第306行,保存
rv=减少(自编程)
TypeError:无法pickle thread.lock对象
回溯(最近一次呼叫最后一次):
文件“one_file5.py”,第98行,在
运行()
文件“one_file5.py”,第89行,运行中
LOGGER.info('features\u rdd rdd created,…%s',features\u rdd.count())
文件“/mnt/thread/usercache/hadoop/appcache/application_154157615027_0010/container_154157615027_0010_01_000001/pyspark.zip/pyspark/rdd.py”,第1041行,以计数
文件“/mnt/thread/usercache/hadoop/appcache/application_154157615027_0010/container_154157615027_0010_01_000001/pyspark.zip/pyspark/rdd.py”,第1032行,总计
文件“/mnt/thread/usercache/hadoop/appcache/application_154157615027_0010/container_154157615027_0010_01_000001/pyspark.zip/pyspark/rdd.py”,第906行,折叠
文件“/mnt/thread/usercache/hadoop/appcache/application_154157615027_0010/container_154157615027_0010_01_000001/pyspark.zip/pyspark/rdd.py”,第809行,在collect中
文件“/mnt/thread/usercache/hadoop/appcache/application_154157615027_0010/container_154157615027_0010_01_000001/pyspark.zip/pyspark/rdd.py”,第2455行,在jrdd中
文件“/mnt/thread/usercache/hadoop/appcache/application_154157615027_0010/container_154157615027_0010_01_000001/pyspark.zip/pyspark/rdd.py”,第2388行,在_wrap_函数中
文件“/mnt/thread/usercache/hadoop/appcache/application_154157615027_0010/container_154157615027_0010_01_000001/pyspark.zip/pyspark/rdd.py”,第2374行,位于“为python\u rdd准备”
文件“/mnt/thread/usercache/hadoop/appcache/application_154157615027_0010/container_154157615027_0010_01_000001/pyspark.zip/pyspark/serializers.py”,第464行,转储
文件“/mnt/thread/usercache/hadoop/appcache/application_154157615027_0010/container_154157615027_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py”,第704行,转储
文件“/mnt/thread/usercache/hadoop/appcache/application_154157615027_0010/container_154157615027_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py”,第162行,在转储中
pickle.PicklingError:无法序列化对象:TypeError:无法pickle thread.lock对象。
当我注释掉features_rdd的代码部分时,程序运行良好,这意味着features_rdd部分有一些不正确的地方。不确定我在这里做错了什么

我在AWS EMR中运行该程序,有4名执行者。 执行器核心7 执行器RAM 8GB
Spark 2.2.1版

将当前代码替换为
映射分区

def extract_features_(xs):
    model_data = initVGG16()
    for k, v in xs:
        yield k, extract_features(model_data, v)

features_rdd = s3_files_rdd.mapPartitions(extract_features_)

上面的方法很好,但是当我得到特性时,一旦我想通过将相同的函数传递给某个write to_s3函数来写入s3,它就会再次返回相同的错误。更新了我的问题