Python 如何在内存中创建文本数据集?

Python 如何在内存中创建文本数据集?,python,tensorflow,keras,nlp,tensorflow2.0,Python,Tensorflow,Keras,Nlp,Tensorflow2.0,如何创建一个DataSet对象,该对象包含一组用于tensorflow文本处理的单词 假设我有一个这样的单词列表 words = [ ['This', 'is', 'the', 'first'], [ 'and', 'another'] ] 因此,每个培训/测试样本的项目数量是可变的。 (实际上,我是从数据库中获取文本,并使用Spacy提取相关单词) 我正在使用带有这些属性的IMDB数据集的from tensorflow.org,但希望切换到使用我拥有

如何创建一个
DataSet
对象,该对象包含一组用于tensorflow文本处理的单词

假设我有一个这样的单词列表

words  = [ ['This', 'is', 'the', 'first'],
           [ 'and', 'another']
         ]
因此,每个培训/测试样本的项目数量是可变的。 (实际上,我是从数据库中获取文本,并使用Spacy提取相关单词)

我正在使用带有这些属性的IMDB数据集的from tensorflow.org,但希望切换到使用我拥有的数据

import tensorflow as tf


from tensorflow import keras
from tensorflow.keras import layers

import tensorflow_datasets as tfds
tfds.disable_progress_bar()



(train_data, test_data), info = tfds.load(
    'imdb_reviews/subwords8k',
    split = (tfds.Split.TRAIN, tfds.Split.TEST),
    with_info=True, as_supervised=True)


#train_data = ???  How do I make it from my own set of words/sentences

encoder = info.features['text'].encoder

train_batches = train_data.shuffle(1000).padded_batch(10)
test_batches = test_data.shuffle(1000).padded_batch(10)

embedding_dim=16

model = keras.Sequential([
  layers.Embedding(encoder.vocab_size, embedding_dim),
  layers.GlobalAveragePooling1D(),
  layers.Dense(16, activation='relu'),
  layers.Dense(1)
])

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(
    train_batches,
    epochs=10,
    validation_data=test_batches, validation_steps=20)


您可以使用Keras标记它们,并填充序列,使它们具有相同的长度。例如:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

X_train = ['They like my dog', 'I hate my cat', 'We will love my hamster', 
           'I dislike your llama']
X_test = ['We love our hamster', 'They hate our platypus']
y_train = [1, 0, 1, 0]
y_test = [1, 0]

encoder = keras.preprocessing.text.Tokenizer()

encoder.fit_on_texts(X_train)

X_train = encoder.texts_to_sequences(X_train)
X_test = encoder.texts_to_sequences(X_test)

maxlen = max(map(len, X_train))

X_train = keras.preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = keras.preprocessing.sequence.pad_sequences(X_test, maxlen=maxlen)

train_batches = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(1)
test_batches = tf.data.Dataset.from_tensor_slices((X_test, y_test)).batch(1)

embedding_dim = 16

model = keras.Sequential([
  layers.Embedding(len(encoder.index_word) + 1, embedding_dim),
  layers.GlobalAveragePooling1D(),
  layers.Dense(24, activation='relu'),
  layers.Dense(1)
])

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_batches, epochs=50, validation_data=test_batches)

你打算对我的回答给予反馈吗?如果我能提高,请告诉我it@NicolasGervais在回到这个话题之前,我一直在忙其他的事情——抓紧点。
1/4 [====>......] - ETA: 0s - loss: 0.1935 - acc: 1.0000
4/4 [===========] - 5ms/step - loss: 0.212 - acc: 1.00 - val_loss: 0.416 - val_acc: 1.00