通过特征列，使用数据集API将自由文本特征输入Tensorflow罐头估计器_Tensorflow_Google Cloud Ml_Tensorflow Datasets_Tensorflow Estimator

通过特征列，使用数据集API将自由文本特征输入Tensorflow罐头估计器

tensorflow

通过特征列，使用数据集API将自由文本特征输入Tensorflow罐头估计器,tensorflow,google-cloud-ml,tensorflow-datasets,tensorflow-estimator,Tensorflow,Google Cloud Ml,Tensorflow Datasets,Tensorflow Estimator,我正在尝试构建一个模型，该模型给出reddit\u分数=f（'subreddit'，'comment'）这主要是作为一个示例，我可以在此基础上构建一个工作项目我的密码是我的问题是，我发现固定估值器（例如）必须具有属于FeatureColumn类的feature_列我有我的vocab文件，我知道如果我只限于评论的第一个单词，我可以做如下事情 tf.feature_column.categorical_column_with_vocabulary_file( key='com

我正在尝试构建一个模型，该模型给出

reddit\u分数=f（'subreddit'，'comment'）

这主要是作为一个示例，我可以在此基础上构建一个工作项目

我的密码是

我的问题是，我发现固定估值器（例如）必须具有属于

FeatureColumn

类的feature_列

我有我的vocab文件，我知道如果我只限于评论的第一个单词，我可以做如下事情

tf.feature_column.categorical_column_with_vocabulary_file(
        key='comment',
        vocabulary_file='{}/vocab.csv'.format(INPUT_DIR)
        )

但是，如果我传入注释中的前10个单词，那么我不知道如何从字符串（如

“这是一个预填充的10个单词注释xyzpadxyz xyzpadxyz”

）转到

功能列

，这样我就可以构建一个嵌入，以在宽深模型中传递到

深功能
看起来它一定是非常明显或简单的东西，但我一生都找不到任何具有这种特殊设置的现有示例（罐装的广度和深度、数据集api以及多种功能的组合，例如subreddit和原始文本功能，如注释）
我甚至想自己做vocab整数查找，这样我传入的注释
功能类似于[23,45,67,12,1345,7,9999999]，然后我可以通过一个形状的数字_功能得到它，然后从那里做一些事情。但这感觉有点奇怪 您可以使用tf.string_split（），然后执行tf.slice（）对其进行切片，注意tf.pad（）首先使用零对字符串进行切片。请参阅中的标题预处理操作：

一旦你有了这些词，你就可以创建十个功能栏，按照post@Lak-did的方法添加答案，但是对dataset api做了一些调整
# Create an input function reading a file using the Dataset API
# Then provide the results to the Estimator API
def read_dataset(prefix, mode, batch_size):

    def _input_fn():

        def decode_csv(value_column):

            columns = tf.decode_csv(value_column, field_delim='|', record_defaults=DEFAULTS)
            features = dict(zip(CSV_COLUMNS, columns))

            features['comment_words'] = tf.string_split([features['comment']])
            features['comment_words'] = tf.sparse_tensor_to_dense(features['comment_words'], default_value=PADWORD)
            features['comment_padding'] = tf.constant([[0,0],[0,MAX_DOCUMENT_LENGTH]])
            features['comment_padded'] = tf.pad(features['comment_words'], features['comment_padding'])
            features['comment_sliced'] = tf.slice(features['comment_padded'], [0,0], [-1, MAX_DOCUMENT_LENGTH])
            features['comment_words'] = tf.pad(features['comment_sliced'], features['comment_padding'])
            features['comment_words'] = tf.slice(features['comment_words'],[0,0],[-1,MAX_DOCUMENT_LENGTH])

            features.pop('comment_padding')
            features.pop('comment_padded')
            features.pop('comment_sliced')

            label = features.pop(LABEL_COLUMN)

            return features, label

        # Use prefix to create file path
        file_path = '{}/{}*{}*'.format(INPUT_DIR, prefix, PATTERN)

        # Create list of files that match pattern
        file_list = tf.gfile.Glob(file_path)

        # Create dataset from file list
        dataset = (tf.data.TextLineDataset(file_list)  # Read text file
                    .map(decode_csv))  # Transform each elem by applying decode_csv fn

        tf.logging.info("...dataset.output_types={}".format(dataset.output_types))
        tf.logging.info("...dataset.output_shapes={}".format(dataset.output_shapes))

        if mode == tf.estimator.ModeKeys.TRAIN:

            num_epochs = None # indefinitely
            dataset = dataset.shuffle(buffer_size = 10 * batch_size)

        else:

            num_epochs = 1 # end-of-input after this

        dataset = dataset.repeat(num_epochs).batch(batch_size)

        return dataset.make_one_shot_iterator().get_next()

    return _input_fn

然后，在下面的函数中，我们可以引用我们作为decode\u csv（）
的一部分创建的字段：
我在想，也许我可以做一个函数，将注释功能拆分为注释词类型功能，作为read_dataset（）函数的一部分。嗯，我得到的是“无法在组件1中批处理具有不同形状的张量”。第一个元素有形状[1,11]，元素1有形状[1,6]`am猜测是由于变量len注释。你知道我是可以在decode_csv（）中处理这个问题，还是需要在原始数据本身中处理这个问题吗？只要你在每个功能中有相同的字数（这对我的用例来说很好），上面的方法就行。如果没有，我认为需要使用dataset.padded_批处理，请参见此处的相关问题：
# Define feature columns
def get_wide_deep():

    EMBEDDING_SIZE = 10

    # Define column types
    subreddit = tf.feature_column.categorical_column_with_vocabulary_list('subreddit', ['news', 'ireland', 'pics'])

    comment_embeds = tf.feature_column.embedding_column(
        categorical_column = tf.feature_column.categorical_column_with_vocabulary_file(
            key='comment_words',
            vocabulary_file='{}/vocab.csv-00000-of-00001'.format(INPUT_DIR),
            vocabulary_size=100
            ),
        dimension = EMBEDDING_SIZE
        )

    # Sparse columns are wide, have a linear relationship with the output
    wide = [ subreddit ]

    # Continuous columns are deep, have a complex relationship with the output
    deep = [ comment_embeds ]

    return wide, deep