tensorflow 2.0中的word2vec实现_Tensorflow_Deep Learning_Word2vec

tensorflow 2.0中的word2vec实现

tensorflow deep-learning

tensorflow 2.0中的word2vec实现,tensorflow,deep-learning,word2vec,Tensorflow,Deep Learning,Word2vec,我想使用tensorflow 2.0实现word2vec 我根据skip-gramm模型准备了数据集，获得了约1800万个观察结果（目标词和上下文词）我的目标使用了以下数据集：我已经为n-gramm模型创建了一个新的数据集。我使用了windows_大小2，跳过次数也等于2，以便为每个目标单词创建上下文单词（作为我们的输入）（这是我必须预测的）。看起来是这样的： target context 1 3 1 1 2 1 2

我想使用tensorflow 2.0实现word2vec 我根据skip-gramm模型准备了数据集，获得了约1800万个观察结果（目标词和上下文词）

我的目标使用了以下数据集：

我已经为n-gramm模型创建了一个新的数据集。我使用了windows_大小2，跳过次数也等于2，以便为每个目标单词创建上下文单词（作为我们的输入）（这是我必须预测的）。看起来是这样的：

target  context
  1        3
  1        1
  2        1 
  2       1222

这是我的密码：

dataset_train = tf.data.Dataset.from_tensor_slices((target, context))
dataset_train = dataset_train.shuffle(buffer_size=1024).batch(64)

#Parameters:
num_words = len(word_index)#approximately 100000
embed_size = 300
num_sampled = 64
initializer_softmax = tf.keras.initializers.GlorotUniform()
#Variables:
embeddings_weight = tf.Variable(tf.random.uniform([num_words,embed_size],-1.0,1.0))
softmax_weight = tf.Variable(initializer_softmax([num_words,embed_size]))
softmax_bias = tf.Variable(initializer_softmax([num_words]))

optimizer = tf.keras.optimizers.Adam()


#As before, we are supplying a list of integers (that correspond to our validation vocabulary words) to the embedding_lookup() function, which looks up these rows in the normalized_embeddings tensor, and returns the subset of validation normalized embeddings.  
#Now that we have the normalized validation tensor, valid_embeddings, we can multiply this by the full normalized vocabulary (normalized_embedding) to finalize our similarity calculation:
@tf.function
def training(X,y):
    with tf.GradientTape() as tape:
        embed = tf.nn.embedding_lookup(embeddings_weight,X)
        loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(weights = softmax_weight, biases = softmax_bias, inputs = embed,
                                   labels = y, num_sampled = num_sampled, num_classes = num_words))
    variables = [embeddings_weight,softmax_weight,softmax_bias]  
    gradients = tape.gradient(loss,variables)
    optimizer.apply_gradients(zip(gradients,variables))


EPOCHS = 30
for epoch in range(EPOCHS):
    print('Epoch:',epoch)
    for X,y in dataset_train:
        training(X,y)  


#compute similarity of words: 
norm =  tf.sqrt(tf.reduce_sum(tf.square(embeddings_weight), 1, keepdims=True))
norm_embed = embeddings_weight/ norm
temp_emb = tf.nn.embedding_lookup(norm_embed,X)
similarity = tf.matmul(temp_emb,tf.transpose(norm_embed))

但即使是一个历元的计算时间也太长。是否有可能以某种方式提高代码的性能？（我正在使用GoogleColab执行代码）

编辑：这是我的火车数据集的一个形状

dataset_train

<BatchDataset shapes: ((None,), (None, 1)), types: (tf.int64, tf.int64)>

dataset\u列车

我按照本指南中的说明进行操作：

这是因为softmax函数在处理Word2Vec算法中数百万个点的可能性时，在计算上相当昂贵，正如所解释的那样。负采样可以加快训练速度