Bahdanau和#x27的两个不同代码之间有什么区别；Tensorflow官方教程中给出了哪些注意事项？_Tensorflow_Keras_Deep Learning_Tensorflow2.0_Attention Model

Bahdanau和#x27的两个不同代码之间有什么区别；Tensorflow官方教程中给出了哪些注意事项？

tensorflow keras deep-learning

Bahdanau和#x27的两个不同代码之间有什么区别；Tensorflow官方教程中给出了哪些注意事项？,tensorflow,keras,deep-learning,tensorflow2.0,attention-model,Tensorflow,Keras,Deep Learning,Tensorflow2.0,Attention Model,我正在阅读和编写机器翻译任务的代码，在两个不同的教程中遇到了困难其中一个是纸质实现，其中他们使用了[642048]的图像功能，使得每个图像是一个64个单词的句子，句子中的每个单词的嵌入长度为2048。我完全理解这个实现，下面是Bahdanau的加法风格注意的代码： class BahdanauAttention(tf.keras.Model): def __init__(self, units): super(BahdanauAttention, self).__init__()

我正在阅读和编写机器翻译任务的代码，在两个不同的教程中遇到了困难

其中一个是纸质实现，其中他们使用了

[642048]

的图像功能，使得每个图像是一个64个单词的句子，句子中的每个单词的嵌入长度为2048。我完全理解这个实现，下面是

Bahdanau的加法风格注意的代码：
class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, features, hidden):
    hidden_with_time_axis = tf.expand_dims(hidden, 1)
    attention_hidden_layer = (tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis)))
  
    score = self.V(attention_hidden_layer)

    attention_weights = tf.nn.softmax(score, axis=1)

    context_vector = attention_weights * features
    context_vector = tf.reduce_sum(context_vector, axis=1)
    
    return context_vector, attention_weights

但当我去的时候，我在那里发现了这个复杂的东西，我无法理解这里发生了什么：
class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super().__init__()
    self.W1 = tf.keras.layers.Dense(units, use_bias=False)
    self.W2 = tf.keras.layers.Dense(units, use_bias=False)
    
    self.attention = tf.keras.layers.AdditiveAttention()

  def call(self, query, value, mask):
    w1_query = self.W1(query)
    w2_key = self.W2(value)

    query_mask = tf.ones(tf.shape(query)[:-1], dtype=bool)
    value_mask = mask

    context_vector, attention_weights = self.attention(inputs = [w1_query, value, w2_key],mask=[query_mask, value_mask],return_attention_scores = True,)
    return context_vector, attention_weights

我想问
两者之间有什么区别？
为什么我们不能在第二个版本中使用代码生成字幕，反之亦然？
这回答了你的问题吗？你的意思是说我们可以在两个任务中交替使用这两个东西，如果我可以传递相同形状的输入和输出参数的话？它们看起来像是同一事物的两个不同实现，它们之间有细微的差异。看看代码，我认为它们在性能上不会有任何实际差异。然而，为了得到正确的精确实现，您必须阅读介绍Bahdanau注意的论文。@Sustimagrawal他们是否使用分数、当前单词和隐藏的解码器状态\u t-1来生成解码器状态\u t
。然后再次使用这个词、分数和这个新的\u解码器\u状态\u t
来产生向量？我想他们是在第二个密码上做的。你能评论一下吗？