Python 如何将pickeled ML模型从GCS加载到Dataflow/apachebeam
我在本地开发了一个ApacheBeam管道,在这里我对一个示例文件运行预测 在本地计算机上,我可以按如下方式加载模型:Python 如何将pickeled ML模型从GCS加载到Dataflow/apachebeam,python,google-cloud-platform,google-cloud-dataflow,pickle,apache-beam,Python,Google Cloud Platform,Google Cloud Dataflow,Pickle,Apache Beam,我在本地开发了一个ApacheBeam管道,在这里我对一个示例文件运行预测 在本地计算机上,我可以按如下方式加载模型: with open('gs://newbucket322/my_dumped_classifier.pkl', 'rb') as fid: gnb_loaded = cPickle.load(fid) 但在谷歌数据流上运行时,这显然不起作用。我试着改变到GS://的路径,但这显然也不起作用 我还尝试了以下代码段(用于加载文件: class ReadGcsBlobs(
with open('gs://newbucket322/my_dumped_classifier.pkl', 'rb') as fid:
gnb_loaded = cPickle.load(fid)
但在谷歌数据流上运行时,这显然不起作用。我试着改变到GS://的路径,但这显然也不起作用
我还尝试了以下代码段(用于加载文件:
class ReadGcsBlobs(beam.DoFn):
def process(self, element, *args, **kwargs):
from apache_beam.io.gcp import gcsio
gcs = gcsio.GcsIO()
yield (element, gcs.open(element).read())
model = (p
| "Initialize" >> beam.Create(["gs://bucket/file.pkl"])
| "Read blobs" >> beam.ParDo(ReadGcsBlobs())
)
但是,当想要加载我的模型时,这不起作用,或者至少我不能使用这个模型变量来调用predict方法
这应该是一个非常简单的任务,但我似乎找不到一个简单的答案。您可以定义一个ParDo,如下所示
class PerdictOutcome(beam.DoFn):
""" Format the input to the desired shape"""
def __init__(self, project=None, bucket_name=None, model_path=None, destination_name=None):
self._model = None
self._project = project
self._bucket_name = bucket_name
self._model_path = model_path
self._destination_name = destination_name
def download_blob(bucket_name=None, source_blob_name=None, project=None, destination_file_name=None):
"""Downloads a blob from the bucket."""
destination_file_name = source_blob_name
storage_client = storage.Client(<gs://path">)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
# Load once or very few times
def setup(self):
logging.info(
"Model Initialization {}".format(self._model_path))
download_blob(bucket_name=self._bucket_name, source_blob_name=self._model_path,
project=self._project, destination_file_name=self._destination_name)
# unpickle model model
self._model = pickle.load(open(self._destination_name, 'rb'))
def process(self, element):
element["prediction"] = self._model.predict(element["data"])
return [element]
你能用ReadGcsBlobs的函数定义更新你的问题吗?很抱歉,现在已经添加了定义。嗨,我无法让这段代码正常工作。首先,没有self.download\u blob,下载\u blob是不能调用的,其次,下载\u blob函数没有那么多参数。我不太理解代码,目标在哪里_name get defined?我已经更新了答案。我没有执行特定的代码来检查语法,但这应该会让您了解如何运行在GCS中经过pickle处理并存储的模型。如果您觉得这很有用,请接受答案。@JayadeepJayaraman我有与此非常相似的代码,但出于某种原因,它似乎运行了工作机空间和工作空间不足。我给了虚拟机1000GB,型号大约50MB。知道为什么吗?
model = (p
| "Read Files" >> TextIO...
| "Run Predictions" >> beam.ParDo(PredictSklearn(project=known_args.bucket_project_id, bucket_name=known_args.bucket_name, model_path=known_args.model_path, destination_name=known_args.destination_name)
)