Python Tensorflow垫序列特征列_Python_Tensorflow_Machine Learning_Deep Learning_Tensorflow2.0

Python Tensorflow垫序列特征列
python tensorflow machine-learning deep-learning
Python Tensorflow垫序列特征列,python,tensorflow,machine-learning,deep-learning,tensorflow2.0,Python,Tensorflow,Machine Learning,Deep Learning,Tensorflow2.0,如何在feature（特征）列中填充序列，以及在feature（特征）列中的维度是什么我正在使用tensorflow2.0并实现一个文本摘要示例。对于机器学习、深度学习和TensorFlow来说，这是一个全新的概念我遇到了feature\u列，发现它们很有用，因为我认为它们可以嵌入到模型的处理管道中在不使用feature\u column的经典场景中，我可以预处理文本，标记文本，将其转换为数字序列，然后将它们填充到一个包含100个单词的maxlen。使用功能列时，我无法完成此操作下面是我
如何在feature（特征）列中填充序列，以及在feature（特征）列中的
维度是什么
我正在使用tensorflow2.0
并实现一个文本摘要示例。对于机器学习、深度学习和TensorFlow来说，这是一个全新的概念
我遇到了feature\u列
，发现它们很有用，因为我认为它们可以嵌入到模型的处理管道中
在不使用feature\u column
的经典场景中，我可以预处理文本，标记文本，将其转换为数字序列，然后将它们填充到一个包含100个单词的maxlen
。使用功能列时，我无法完成此操作
下面是我迄今为止写的东西

训练数据集=tf.data.experimental.make\u csv\u数据集(
“assets/train\u dataset.csv”，label\u name=label，num\u epochs=1，shuffle=True，shuffle\u buffer\u size=10000，batch\u size=1，ignore\u errors=True）
词汇=ds.get_词汇（）
def text_演示（功能列）：
特征层=tf.keras.experimental.SequenceFeatures（特征列）
文章，u=next（iter（train_dataset.take（1）））
tokenizer=tf_text.WhitespaceTokenizer（）
tokenized=tokenizer.tokenize（文章['Text']）
序列输入，序列长度=特征层（{'Text'：标记化。to_tensor（）}）
打印（顺序输入）
def分类列（功能列）：
密集柱=tf.keras.layers.DenseFeatures（特征柱）
文章，u=next（iter（train_dataset.take（1）））
lang_tokenizer=tf.keras.preprocessing.text.tokenizer(
过滤器=“”）
lang_标记器。适合文本（文章）
tensor=lang\u标记符。文本到序列（文章）
张量=tf.keras.preprocessing.sequence.pad_序列（张量，
padding='post'，maxlen=50）
打印（密集列（张量）.numpy（））
text_seq_vocab_list=tf.feature_column.sequence_category_column_与_vocability_list（key='text'，vocability_list=list（词汇））
文本嵌入=tf.特征列.嵌入列（文本序列语音列表，维度=8）
文本演示（文本嵌入）
数字列表=tf.特征列.分类列与词汇表（key='Text'，词汇表=列表（词汇））
嵌入=tf.特征列.嵌入列（数字voacb列，维度=8）
分类列（嵌入）

我也不知道在这里使用什么，sequence\u categorical\u column\u with\u词汇表
或categorical\u column\u with\u词汇表
。在文档中，SequenceFeatures
也没有解释，尽管我知道这是一个实验性功能
我也无法理解dimension
param做什么？
实际上，这
我也不知道这里用什么，
序列\分类\列\带有\词汇\列表或
带有词汇表列表的分类列
应该是第一个问题，因为它会影响对主题名称的解释
此外，还不清楚你在文本摘要上的意思是什么。您要将处理后的文本传递到哪种类型的模型\层
顺便说一句，这很重要，因为不同的网络架构和方法支持tf.keras.layers.DenseFeatures
和tf.keras.experimental.SequenceFeatures

正如文件所述，SequenceFeatures
层的输出应该被送入序列网络，如RNN
密度特征产生一个密集张量作为输出，因此适用于其他类型的网络
在代码段中执行标记化时，将在模型中使用嵌入。
那么您有两个选择：
将学到的嵌入向前传递到密集层中。这意味着您将不分析单词顺序
将学习到的嵌入传递到卷积、Reccurent、AveragePooling、LSTM层中，并使用单词顺序进行学习
第一种选择需要使用：

tf.keras.layers.DenseFeatures
tf.feature\u column.categorical\u column\u*（）中的一个

和tf.feature\u column.embedding\u column（）

第二种选择需要使用：

tf.keras.experimental.SequenceFeatures
tf.feature\u column.sequence\u categorical\u column\u*（）中的一个

和tf.feature\u column.embedding\u column（）

这里有一些例子。
两个选项的预处理和培训部分相同：
import tensorflow as tf
print(tf.__version__)

from tensorflow import feature_column

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import text_to_word_sequence
import tensorflow.keras.utils as ku
from tensorflow.keras.utils import plot_model

import pandas as pd
from sklearn.model_selection import train_test_split

DATA_PATH = 'C:\SoloLearnMachineLearning\Stackoverflow\TextDataset.csv'

#it is just two column csv, like:
# text;label
# A wiki is run using wiki software;0
# otherwise known as a wiki engine.;1

dataframe = pd.read_csv(DATA_PATH, delimiter = ';')
dataframe.head()

# Preprocessing before feature_clolumn includes
# - getting the vocabulary
# - tokenization, which means only splitting on tokens.
#   Encoding sentences with vocablary will be done by feature_column!
# - padding
# - truncating

# Build vacabulary
vocab_size = 100
oov_tok = '<OOV>'

sentences = dataframe['text'].to_list()

tokenizer = Tokenizer(num_words = vocab_size, oov_token="<OOV>")

tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# if word_index shorter then default value of vocab_size we'll save actual size
vocab_size=len(word_index)
print("vocab_size = word_index = ",len(word_index))

# Split sentensec on tokens. here token = word
# text_to_word_sequence() has good default filter for 
# charachters include basic punctuation, tabs, and newlines
dataframe['text'] = dataframe['text'].apply(text_to_word_sequence)

dataframe.head()

max_length = 6

# paddind and trancating setnences
# do that directly with strings without using tokenizer.texts_to_sequences()
# the feature_colunm will convert strings into numbers
dataframe['text']=dataframe['text'].apply(lambda x, N=max_length: (x + N * [''])[:N])
dataframe['text']=dataframe['text'].apply(lambda x, N=max_length: x[:N])
dataframe.head()

# Define method to create tf.data dataset from Pandas Dataframe
def df_to_dataset(dataframe, label_column, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    #labels = dataframe.pop(label_column)
    labels = dataframe[label_column]

    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds

# Split dataframe into train and validation sets
train_df, val_df = train_test_split(dataframe, test_size=0.2)

print(len(train_df), 'train examples')
print(len(val_df), 'validation examples')

batch_size = 32
ds = df_to_dataset(dataframe, 'label',shuffle=False,batch_size=batch_size)

train_ds = df_to_dataset(train_df, 'label',  shuffle=False, batch_size=batch_size)
val_ds = df_to_dataset(val_df, 'label', shuffle=False, batch_size=batch_size)

# and small batch for demo
example_batch = next(iter(ds))[0]
example_batch

# Helper methods to print exxample outputs of for defined feature_column

def demo(feature_column):
    feature_layer = tf.keras.layers.DenseFeatures(feature_column)
    print(feature_layer(example_batch).numpy())

def seqdemo(feature_column):
    sequence_feature_layer = tf.keras.experimental.SequenceFeatures(feature_column)
    print(sequence_feature_layer(example_batch))

第二个选择是，当我们关注词序并学习我们的模型时
# Define categorical colunm for our text feature, 
# which is preprocessed into lists of tokens
# Note that key name should be the same as original column name in dataframe
text_column = feature_column.
              sequence_categorical_column_with_vocabulary_list(key='text', 
                                                vocabulary_list=list(word_index))

# arguemnt dimention here is exactly the dimension of the space in 
# which tokens will be presented during model's learning
# see the tutorial at https://www.tensorflow.org/beta/tutorials/text/word_embeddings
text_embedding = feature_column.embedding_column(text_column, dimension=8)
print(seqdemo(text_embedding))

# The define the layers and model it self
# This example uses Keras Functional API instead of Sequential 
# just for more generallity

# Define SequenceFeatures layer to pass feature_columns into Keras model
sequence_feature_layer = tf.keras.experimental.SequenceFeatures(text_embedding)

# Define inputs for each feature column. See
# см. https://github.com/tensorflow/tensorflow/issues/27416#issuecomment-502218673
feature_layer_inputs = {}
sequence_feature_layer_inputs = {}

# Here we have just one column

sequence_feature_layer_inputs['text'] = tf.keras.Input(shape=(max_length,),
                                                       name='text',
                                                       dtype=tf.string)
print(sequence_feature_layer_inputs)

# Define outputs of SequenceFeatures layer 
# And accually use them as first layer of the model

# Note here that SequenceFeatures layer produce tuple of two tensors as output.
# We need just first to pass next.
sequence_feature_layer_outputs, _ = sequence_feature_layer(sequence_feature_layer_inputs)
print(sequence_feature_layer_outputs)
# Add consequences layers. See https://keras.io/getting-started/functional-api-guide/

# Conv1D and MaxPooling1D will learn features from words order
x = tf.keras.layers.Conv1D(8,4)(sequence_feature_layer_outputs)
x = tf.keras.layers.MaxPooling1D(2)(x)
# Add consequences layers. See https://keras.io/getting-started/functional-api-guide/
x = tf.keras.layers.Dense(256, activation='relu')(x)
x = tf.keras.layers.Dropout(0.2)(x)

# This example supposes binary classification, as labels are 0 or 1
x = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.models.Model(inputs=[v for v in sequence_feature_layer_inputs.values()],
                              outputs=x)
model.summary()

# This example supposes binary classification, as labels are 0 or 1
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy']
              #run_eagerly=True
             )

# Note that fit() method looking up features in train_ds and valdation_ds by name in 
# tf.keras.Input(shape=(max_length,), name='text'

# This model of cause will learn nothing because of fake data.

num_epochs = 5
history = model.fit(train_ds,
                    validation_data=val_ds,
                    epochs=num_epochs,
                    verbose=1
                    )

请在我的github上找到完整的jupiter笔记本和以下示例：




feature\u列中的参数维度。embedded\u column（）
正是在模型学习过程中表示标记的空间维度。有关详细说明，请参见中的教程
还要注意，使用feature\u column.embedding\u column（）
是tf.keras.layers.embedding（）的替代方法。正如您所看到的，feature\u column
从预处理管道执行编码步骤，但是您仍然应该手动执行句子的拆分、填充和分段操作。这里有帮助吗？？
# Define categorical colunm for our text feature, 
# which is preprocessed into lists of tokens
# Note that key name should be the same as original column name in dataframe
text_column = feature_column.
              sequence_categorical_column_with_vocabulary_list(key='text', 
                                                vocabulary_list=list(word_index))

# arguemnt dimention here is exactly the dimension of the space in 
# which tokens will be presented during model's learning
# see the tutorial at https://www.tensorflow.org/beta/tutorials/text/word_embeddings
text_embedding = feature_column.embedding_column(text_column, dimension=8)
print(seqdemo(text_embedding))

# The define the layers and model it self
# This example uses Keras Functional API instead of Sequential 
# just for more generallity

# Define SequenceFeatures layer to pass feature_columns into Keras model
sequence_feature_layer = tf.keras.experimental.SequenceFeatures(text_embedding)

# Define inputs for each feature column. See
# см. https://github.com/tensorflow/tensorflow/issues/27416#issuecomment-502218673
feature_layer_inputs = {}
sequence_feature_layer_inputs = {}

# Here we have just one column

sequence_feature_layer_inputs['text'] = tf.keras.Input(shape=(max_length,),
                                                       name='text',
                                                       dtype=tf.string)
print(sequence_feature_layer_inputs)

# Define outputs of SequenceFeatures layer 
# And accually use them as first layer of the model

# Note here that SequenceFeatures layer produce tuple of two tensors as output.
# We need just first to pass next.
sequence_feature_layer_outputs, _ = sequence_feature_layer(sequence_feature_layer_inputs)
print(sequence_feature_layer_outputs)
# Add consequences layers. See https://keras.io/getting-started/functional-api-guide/

# Conv1D and MaxPooling1D will learn features from words order
x = tf.keras.layers.Conv1D(8,4)(sequence_feature_layer_outputs)
x = tf.keras.layers.MaxPooling1D(2)(x)
# Add consequences layers. See https://keras.io/getting-started/functional-api-guide/
x = tf.keras.layers.Dense(256, activation='relu')(x)
x = tf.keras.layers.Dropout(0.2)(x)

# This example supposes binary classification, as labels are 0 or 1
x = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.models.Model(inputs=[v for v in sequence_feature_layer_inputs.values()],
                              outputs=x)
model.summary()

# This example supposes binary classification, as labels are 0 or 1
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy']
              #run_eagerly=True
             )

# Note that fit() method looking up features in train_ds and valdation_ds by name in 
# tf.keras.Input(shape=(max_length,), name='text'

# This model of cause will learn nothing because of fake data.

num_epochs = 5
history = model.fit(train_ds,
                    validation_data=val_ds,
                    epochs=num_epochs,
                    verbose=1
                    )