Python Tensorflow垫序列特征列

Python Tensorflow垫序列特征列,python,tensorflow,machine-learning,deep-learning,tensorflow2.0,Python,Tensorflow,Machine Learning,Deep Learning,Tensorflow2.0,如何在feature(特征)列中填充序列,以及在feature(特征)列中的维度是什么 我正在使用tensorflow2.0并实现一个文本摘要示例。对于机器学习、深度学习和TensorFlow来说,这是一个全新的概念 我遇到了feature\u列,发现它们很有用,因为我认为它们可以嵌入到模型的处理管道中 在不使用feature\u column的经典场景中,我可以预处理文本,标记文本,将其转换为数字序列,然后将它们填充到一个包含100个单词的maxlen。使用功能列时,我无法完成此操作 下面是我

如何在feature(特征)列中填充序列,以及在feature(特征)列中的
维度是什么

我正在使用
tensorflow2.0
并实现一个文本摘要示例。对于机器学习、深度学习和TensorFlow来说,这是一个全新的概念

我遇到了
feature\u列
,发现它们很有用,因为我认为它们可以嵌入到模型的处理管道中

在不使用
feature\u column
的经典场景中,我可以预处理文本,标记文本,将其转换为数字序列,然后将它们填充到一个包含100个单词的
maxlen
。使用
功能列时,我无法完成此操作

下面是我迄今为止写的东西


训练数据集=tf.data.experimental.make\u csv\u数据集(
“assets/train\u dataset.csv”,label\u name=label,num\u epochs=1,shuffle=True,shuffle\u buffer\u size=10000,batch\u size=1,ignore\u errors=True)
词汇=ds.get_词汇()
def text_演示(功能列):
特征层=tf.keras.experimental.SequenceFeatures(特征列)
文章,u=next(iter(train_dataset.take(1)))
tokenizer=tf_text.WhitespaceTokenizer()
tokenized=tokenizer.tokenize(文章['Text'])
序列输入,序列长度=特征层({'Text':标记化。to_tensor()})
打印(顺序输入)
def分类列(功能列):
密集柱=tf.keras.layers.DenseFeatures(特征柱)
文章,u=next(iter(train_dataset.take(1)))
lang_tokenizer=tf.keras.preprocessing.text.tokenizer(
过滤器=“”)
lang_标记器。适合文本(文章)
tensor=lang\u标记符。文本到序列(文章)
张量=tf.keras.preprocessing.sequence.pad_序列(张量,
padding='post',maxlen=50)
打印(密集列(张量).numpy())
text_seq_vocab_list=tf.feature_column.sequence_category_column_与_vocability_list(key='text',vocability_list=list(词汇))
文本嵌入=tf.特征列.嵌入列(文本序列语音列表,维度=8)
文本演示(文本嵌入)
数字列表=tf.特征列.分类列与词汇表(key='Text',词汇表=列表(词汇))
嵌入=tf.特征列.嵌入列(数字voacb列,维度=8)
分类列(嵌入)
我也不知道在这里使用什么,
sequence\u categorical\u column\u with\u词汇表
categorical\u column\u with\u词汇表
。在文档中,
SequenceFeatures
也没有解释,尽管我知道这是一个实验性功能

我也无法理解
dimension
param做什么?

实际上,这

我也不知道这里用什么, 序列\分类\列\带有\词汇\列表或 带有词汇表列表的分类列

应该是第一个问题,因为它会影响对主题名称的解释

此外,还不清楚你在文本摘要上的意思是什么。您要将处理后的文本传递到哪种类型的模型\层

顺便说一句,这很重要,因为不同的网络架构和方法支持
tf.keras.layers.DenseFeatures
tf.keras.experimental.SequenceFeatures

正如文件所述,
SequenceFeatures
层的输出应该被送入序列网络,如RNN

密度特征产生一个密集张量作为输出,因此适用于其他类型的网络

在代码段中执行标记化时,将在模型中使用嵌入。 那么您有两个选择:

  • 将学到的嵌入向前传递到密集层中。这意味着您将不分析单词顺序
  • 将学习到的嵌入传递到卷积、Reccurent、AveragePooling、LSTM层中,并使用单词顺序进行学习
  • 第一种选择需要使用:

    • tf.keras.layers.DenseFeatures
    • tf.feature\u column.categorical\u column\u*()中的一个
    • tf.feature\u column.embedding\u column()
    第二种选择需要使用:

    • tf.keras.experimental.SequenceFeatures
    • tf.feature\u column.sequence\u categorical\u column\u*()中的一个
    • tf.feature\u column.embedding\u column()
    这里有一些例子。 两个选项的预处理和培训部分相同:

    import tensorflow as tf
    print(tf.__version__)
    
    from tensorflow import feature_column
    
    from tensorflow.keras.preprocessing.text import Tokenizer
    from tensorflow.keras.preprocessing.sequence import pad_sequences
    from tensorflow.keras.preprocessing.text import text_to_word_sequence
    import tensorflow.keras.utils as ku
    from tensorflow.keras.utils import plot_model
    
    import pandas as pd
    from sklearn.model_selection import train_test_split
    
    DATA_PATH = 'C:\SoloLearnMachineLearning\Stackoverflow\TextDataset.csv'
    
    #it is just two column csv, like:
    # text;label
    # A wiki is run using wiki software;0
    # otherwise known as a wiki engine.;1
    
    dataframe = pd.read_csv(DATA_PATH, delimiter = ';')
    dataframe.head()
    
    # Preprocessing before feature_clolumn includes
    # - getting the vocabulary
    # - tokenization, which means only splitting on tokens.
    #   Encoding sentences with vocablary will be done by feature_column!
    # - padding
    # - truncating
    
    # Build vacabulary
    vocab_size = 100
    oov_tok = '<OOV>'
    
    sentences = dataframe['text'].to_list()
    
    tokenizer = Tokenizer(num_words = vocab_size, oov_token="<OOV>")
    
    tokenizer.fit_on_texts(sentences)
    word_index = tokenizer.word_index
    
    # if word_index shorter then default value of vocab_size we'll save actual size
    vocab_size=len(word_index)
    print("vocab_size = word_index = ",len(word_index))
    
    # Split sentensec on tokens. here token = word
    # text_to_word_sequence() has good default filter for 
    # charachters include basic punctuation, tabs, and newlines
    dataframe['text'] = dataframe['text'].apply(text_to_word_sequence)
    
    dataframe.head()
    
    max_length = 6
    
    # paddind and trancating setnences
    # do that directly with strings without using tokenizer.texts_to_sequences()
    # the feature_colunm will convert strings into numbers
    dataframe['text']=dataframe['text'].apply(lambda x, N=max_length: (x + N * [''])[:N])
    dataframe['text']=dataframe['text'].apply(lambda x, N=max_length: x[:N])
    dataframe.head()
    
    # Define method to create tf.data dataset from Pandas Dataframe
    def df_to_dataset(dataframe, label_column, shuffle=True, batch_size=32):
        dataframe = dataframe.copy()
        #labels = dataframe.pop(label_column)
        labels = dataframe[label_column]
    
        ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
        if shuffle:
            ds = ds.shuffle(buffer_size=len(dataframe))
        ds = ds.batch(batch_size)
        return ds
    
    # Split dataframe into train and validation sets
    train_df, val_df = train_test_split(dataframe, test_size=0.2)
    
    print(len(train_df), 'train examples')
    print(len(val_df), 'validation examples')
    
    batch_size = 32
    ds = df_to_dataset(dataframe, 'label',shuffle=False,batch_size=batch_size)
    
    train_ds = df_to_dataset(train_df, 'label',  shuffle=False, batch_size=batch_size)
    val_ds = df_to_dataset(val_df, 'label', shuffle=False, batch_size=batch_size)
    
    # and small batch for demo
    example_batch = next(iter(ds))[0]
    example_batch
    
    # Helper methods to print exxample outputs of for defined feature_column
    
    def demo(feature_column):
        feature_layer = tf.keras.layers.DenseFeatures(feature_column)
        print(feature_layer(example_batch).numpy())
    
    def seqdemo(feature_column):
        sequence_feature_layer = tf.keras.experimental.SequenceFeatures(feature_column)
        print(sequence_feature_layer(example_batch))
    
    第二个选择是,当我们关注词序并学习我们的模型时

    # Define categorical colunm for our text feature, 
    # which is preprocessed into lists of tokens
    # Note that key name should be the same as original column name in dataframe
    text_column = feature_column.
                  sequence_categorical_column_with_vocabulary_list(key='text', 
                                                    vocabulary_list=list(word_index))
    
    # arguemnt dimention here is exactly the dimension of the space in 
    # which tokens will be presented during model's learning
    # see the tutorial at https://www.tensorflow.org/beta/tutorials/text/word_embeddings
    text_embedding = feature_column.embedding_column(text_column, dimension=8)
    print(seqdemo(text_embedding))
    
    # The define the layers and model it self
    # This example uses Keras Functional API instead of Sequential 
    # just for more generallity
    
    # Define SequenceFeatures layer to pass feature_columns into Keras model
    sequence_feature_layer = tf.keras.experimental.SequenceFeatures(text_embedding)
    
    # Define inputs for each feature column. See
    # см. https://github.com/tensorflow/tensorflow/issues/27416#issuecomment-502218673
    feature_layer_inputs = {}
    sequence_feature_layer_inputs = {}
    
    # Here we have just one column
    
    sequence_feature_layer_inputs['text'] = tf.keras.Input(shape=(max_length,),
                                                           name='text',
                                                           dtype=tf.string)
    print(sequence_feature_layer_inputs)
    
    # Define outputs of SequenceFeatures layer 
    # And accually use them as first layer of the model
    
    # Note here that SequenceFeatures layer produce tuple of two tensors as output.
    # We need just first to pass next.
    sequence_feature_layer_outputs, _ = sequence_feature_layer(sequence_feature_layer_inputs)
    print(sequence_feature_layer_outputs)
    # Add consequences layers. See https://keras.io/getting-started/functional-api-guide/
    
    # Conv1D and MaxPooling1D will learn features from words order
    x = tf.keras.layers.Conv1D(8,4)(sequence_feature_layer_outputs)
    x = tf.keras.layers.MaxPooling1D(2)(x)
    # Add consequences layers. See https://keras.io/getting-started/functional-api-guide/
    x = tf.keras.layers.Dense(256, activation='relu')(x)
    x = tf.keras.layers.Dropout(0.2)(x)
    
    # This example supposes binary classification, as labels are 0 or 1
    x = tf.keras.layers.Dense(1, activation='sigmoid')(x)
    
    model = tf.keras.models.Model(inputs=[v for v in sequence_feature_layer_inputs.values()],
                                  outputs=x)
    model.summary()
    
    # This example supposes binary classification, as labels are 0 or 1
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy']
                  #run_eagerly=True
                 )
    
    # Note that fit() method looking up features in train_ds and valdation_ds by name in 
    # tf.keras.Input(shape=(max_length,), name='text'
    
    # This model of cause will learn nothing because of fake data.
    
    num_epochs = 5
    history = model.fit(train_ds,
                        validation_data=val_ds,
                        epochs=num_epochs,
                        verbose=1
                        )
    
    请在我的github上找到完整的jupiter笔记本和以下示例:

    feature\u列中的参数维度。embedded\u column()
    正是在模型学习过程中表示标记的空间维度。有关详细说明,请参见中的教程


    还要注意,使用
    feature\u column.embedding\u column()
    tf.keras.layers.embedding()的替代方法。正如您所看到的,
    feature\u column
    从预处理管道执行编码步骤,但是您仍然应该手动执行句子的拆分、填充和分段操作。

    这里有帮助吗??
    # Define categorical colunm for our text feature, 
    # which is preprocessed into lists of tokens
    # Note that key name should be the same as original column name in dataframe
    text_column = feature_column.
                  sequence_categorical_column_with_vocabulary_list(key='text', 
                                                    vocabulary_list=list(word_index))
    
    # arguemnt dimention here is exactly the dimension of the space in 
    # which tokens will be presented during model's learning
    # see the tutorial at https://www.tensorflow.org/beta/tutorials/text/word_embeddings
    text_embedding = feature_column.embedding_column(text_column, dimension=8)
    print(seqdemo(text_embedding))
    
    # The define the layers and model it self
    # This example uses Keras Functional API instead of Sequential 
    # just for more generallity
    
    # Define SequenceFeatures layer to pass feature_columns into Keras model
    sequence_feature_layer = tf.keras.experimental.SequenceFeatures(text_embedding)
    
    # Define inputs for each feature column. See
    # см. https://github.com/tensorflow/tensorflow/issues/27416#issuecomment-502218673
    feature_layer_inputs = {}
    sequence_feature_layer_inputs = {}
    
    # Here we have just one column
    
    sequence_feature_layer_inputs['text'] = tf.keras.Input(shape=(max_length,),
                                                           name='text',
                                                           dtype=tf.string)
    print(sequence_feature_layer_inputs)
    
    # Define outputs of SequenceFeatures layer 
    # And accually use them as first layer of the model
    
    # Note here that SequenceFeatures layer produce tuple of two tensors as output.
    # We need just first to pass next.
    sequence_feature_layer_outputs, _ = sequence_feature_layer(sequence_feature_layer_inputs)
    print(sequence_feature_layer_outputs)
    # Add consequences layers. See https://keras.io/getting-started/functional-api-guide/
    
    # Conv1D and MaxPooling1D will learn features from words order
    x = tf.keras.layers.Conv1D(8,4)(sequence_feature_layer_outputs)
    x = tf.keras.layers.MaxPooling1D(2)(x)
    # Add consequences layers. See https://keras.io/getting-started/functional-api-guide/
    x = tf.keras.layers.Dense(256, activation='relu')(x)
    x = tf.keras.layers.Dropout(0.2)(x)
    
    # This example supposes binary classification, as labels are 0 or 1
    x = tf.keras.layers.Dense(1, activation='sigmoid')(x)
    
    model = tf.keras.models.Model(inputs=[v for v in sequence_feature_layer_inputs.values()],
                                  outputs=x)
    model.summary()
    
    # This example supposes binary classification, as labels are 0 or 1
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy']
                  #run_eagerly=True
                 )
    
    # Note that fit() method looking up features in train_ds and valdation_ds by name in 
    # tf.keras.Input(shape=(max_length,), name='text'
    
    # This model of cause will learn nothing because of fake data.
    
    num_epochs = 5
    history = model.fit(train_ds,
                        validation_data=val_ds,
                        epochs=num_epochs,
                        verbose=1
                        )