Python 谷歌云ML引擎&x2B；Tensorflow在input_fn（）中执行预处理/标记化_Python_Tensorflow_Google Cloud Platform_Google Cloud Ml_Google Cloud Ml Engine

Python 谷歌云ML引擎&x2B；Tensorflow在input_fn（）中执行预处理/标记化

python tensorflow google-cloud-platform

Python 谷歌云ML引擎&x2B；Tensorflow在input_fn（）中执行预处理/标记化,python,tensorflow,google-cloud-platform,google-cloud-ml,google-cloud-ml-engine,Python,Tensorflow,Google Cloud Platform,Google Cloud Ml,Google Cloud Ml Engine,我想在输入函数中执行基本的预处理和标记化。我的数据包含在谷歌云存储桶位置（gs://）的csv中，我无法修改。此外，我还需要对我的ml引擎包中的输入文本执行任何修改，以便可以在服务时复制该行为我的输入函数遵循以下基本结构： filename_queue = tf.train.string_input_producer(filenames) reader = tf.TextLineReader() _, rows = reader.read_up_to(filename_queue, num_r

我想在输入函数中执行基本的预处理和标记化。我的数据包含在谷歌云存储桶位置（gs://）的csv中，我无法修改。此外，我还需要对我的ml引擎包中的输入文本执行任何修改，以便可以在服务时复制该行为

我的输入函数遵循以下基本结构：

filename_queue = tf.train.string_input_producer(filenames)
reader = tf.TextLineReader()
_, rows = reader.read_up_to(filename_queue, num_records=batch_size)
text, label = tf.decode_csv(rows, record_defaults = [[""],[""]])

# add logic to filter special characters
# add logic to make all words lowercase
words = tf.string_split(text) # splits based on white space

是否有任何选项可以避免提前对整个数据集执行此预处理？这表明tf.py_func（）可以用于进行这些转换，但是他们认为“缺点是，由于它没有保存在图形中，我无法恢复保存的模型”，因此我不相信这在服务时会有用。如果我定义自己的tf.py_func（）进行预处理，并且在上传到云端的培训师包中定义了它，我会遇到任何问题吗？有没有我没有考虑的替代方案？

最佳实践是编写一个函数，您可以从培训/评估输入和服务输入调用该函数

例如：

def add_engineered(features):
  text = features['text']
  features['words'] = tf.string_split(text)
  return features

然后，在input_fn中，通过调用add_fn包装返回的功能：

def input_fn():
  features = ...
  label = ...
  return add_engineered(features), label

def serving_input_fn():
    feature_placeholders = ...
    features = ...
    return tflearn.utils.input_fn_utils.InputFnOps(
      add_engineered(features),
      None,
      feature_placeholders
    )

在您的服务输入fn中，确保使用添加调用类似地包装返回的功能（而不是功能占位符）：

def input_fn():
  features = ...
  label = ...
  return add_engineered(features), label

def serving_input_fn():
    feature_placeholders = ...
    features = ...
    return tflearn.utils.input_fn_utils.InputFnOps(
      add_engineered(features),
      None,
      feature_placeholders
    )

你的模型会使用“文字”。但是，您在预测时的JSON输入只需要包含“文本”，即原始值

下面是一个完整的工作示例：

Hi-Lak-感谢您的详细回复！我理解input_fn中包含的内容，但我特别想知道除了应用tf.string_split（）之外，还有什么更好的方法。在tf.string_split之前，我希望所有字符都是小写的，并且我还希望从原始文本中去掉特殊字符（比如*或！可能附加到单词末尾），因此在string_split（）之前，“这是一个要标记化的句子！”应该转换为“这是一个要标记化的句子”。py_func（）是唯一的选项吗？这会导致服务时出现问题吗？除了Lak的回答之外，我想回答关于tf.py_func的部分：它不会序列化和反序列化，因此不能用于服务。在add_工程方法中，您不限于tensorflow函数。您可以调用任何Python函数，但前提是，不是TysFooFrices函数可能涉及C++和Python之间传递的数据，导致某些效率低下。核心python函数很简单，但在部署应用程序时，依赖外部模块的函数需要更改配置。