Python 如何在内存中创建文本数据集?
如何创建一个Python 如何在内存中创建文本数据集?,python,tensorflow,keras,nlp,tensorflow2.0,Python,Tensorflow,Keras,Nlp,Tensorflow2.0,如何创建一个DataSet对象,该对象包含一组用于tensorflow文本处理的单词 假设我有一个这样的单词列表 words = [ ['This', 'is', 'the', 'first'], [ 'and', 'another'] ] 因此,每个培训/测试样本的项目数量是可变的。 (实际上,我是从数据库中获取文本,并使用Spacy提取相关单词) 我正在使用带有这些属性的IMDB数据集的from tensorflow.org,但希望切换到使用我拥有
DataSet
对象,该对象包含一组用于tensorflow文本处理的单词
假设我有一个这样的单词列表
words = [ ['This', 'is', 'the', 'first'],
[ 'and', 'another']
]
因此,每个培训/测试样本的项目数量是可变的。
(实际上,我是从数据库中获取文本,并使用Spacy提取相关单词)
我正在使用带有这些属性的IMDB数据集的from tensorflow.org,但希望切换到使用我拥有的数据
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_datasets as tfds
tfds.disable_progress_bar()
(train_data, test_data), info = tfds.load(
'imdb_reviews/subwords8k',
split = (tfds.Split.TRAIN, tfds.Split.TEST),
with_info=True, as_supervised=True)
#train_data = ??? How do I make it from my own set of words/sentences
encoder = info.features['text'].encoder
train_batches = train_data.shuffle(1000).padded_batch(10)
test_batches = test_data.shuffle(1000).padded_batch(10)
embedding_dim=16
model = keras.Sequential([
layers.Embedding(encoder.vocab_size, embedding_dim),
layers.GlobalAveragePooling1D(),
layers.Dense(16, activation='relu'),
layers.Dense(1)
])
model.summary()
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(
train_batches,
epochs=10,
validation_data=test_batches, validation_steps=20)
您可以使用Keras标记它们,并填充序列,使它们具有相同的长度。例如:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
X_train = ['They like my dog', 'I hate my cat', 'We will love my hamster',
'I dislike your llama']
X_test = ['We love our hamster', 'They hate our platypus']
y_train = [1, 0, 1, 0]
y_test = [1, 0]
encoder = keras.preprocessing.text.Tokenizer()
encoder.fit_on_texts(X_train)
X_train = encoder.texts_to_sequences(X_train)
X_test = encoder.texts_to_sequences(X_test)
maxlen = max(map(len, X_train))
X_train = keras.preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = keras.preprocessing.sequence.pad_sequences(X_test, maxlen=maxlen)
train_batches = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(1)
test_batches = tf.data.Dataset.from_tensor_slices((X_test, y_test)).batch(1)
embedding_dim = 16
model = keras.Sequential([
layers.Embedding(len(encoder.index_word) + 1, embedding_dim),
layers.GlobalAveragePooling1D(),
layers.Dense(24, activation='relu'),
layers.Dense(1)
])
model.summary()
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(train_batches, epochs=50, validation_data=test_batches)
你打算对我的回答给予反馈吗?如果我能提高,请告诉我it@NicolasGervais在回到这个话题之前,我一直在忙其他的事情——抓紧点。
1/4 [====>......] - ETA: 0s - loss: 0.1935 - acc: 1.0000
4/4 [===========] - 5ms/step - loss: 0.212 - acc: 1.00 - val_loss: 0.416 - val_acc: 1.00