Python 位置编码会导致较差的收敛,语言建模
这是一个很难回答的问题,但我还是试试看。我正在实现本文中的语言建模体系结构。请参见第2页的图表,第5页顶部的位置或“时间”编码部分。有关位置编码的更多信息,请参见第5页底部/第6页顶部。(第一篇论文的作者把我引向了第二篇。) 简而言之,下面是我的keras实现:Python 位置编码会导致较差的收敛,语言建模,python,encoding,keras,position,language-model,Python,Encoding,Keras,Position,Language Model,这是一个很难回答的问题,但我还是试试看。我正在实现本文中的语言建模体系结构。请参见第2页的图表,第5页顶部的位置或“时间”编码部分。有关位置编码的更多信息,请参见第5页底部/第6页顶部。(第一篇论文的作者把我引向了第二篇。) 简而言之,下面是我的keras实现: word_seq = Input(shape = (SEQ_LEN,), dtype = "int32", name = "word_seq") query = Input(shape = (EMBED_DIM, ), dtype =
word_seq = Input(shape = (SEQ_LEN,), dtype = "int32", name = "word_seq")
query = Input(shape = (EMBED_DIM, ), dtype = "float32", name = "q_input")
#the query for lang. modeling is a constant vector filled with 0.1, as described at the bottom of page 7 in the first linked paper
T_A = Added_Weights(input_dim = (SEQ_LEN, EMBED_DIM))
#Added_Weights is a custom layer I wrote, which I'll post below
#These are the "positional encoding" components
T_C = Added_Weights(input_dim = (SEQ_LEN, EMBED_DIM))
Emb_A = Embedding(output_dim = EMBED_DIM, input_dim = VOCAB_SIZE, input_length = SEQ_LEN, name = "Emb_A")
Emb_C = Embedding(output_dim = EMBED_DIM, input_dim = VOCAB_SIZE, input_length = SEQ_LEN, name = "Emb_C")
int_state_weights = Dense(units = EMBED_DIM, activation = 'linear',
kernel_initializer=RandomNormal(mean=0., stddev = 0.05, seed = None))
layer_output = query
#the loop uses the output from the previous layer as the query, but the first layer's query is just that constant vector
for i in range(0, NUM_LAYERS - 1):
memories = Emb_A(word_seq) #these all re-use the weights instantiated earlier.
memories = T_A(memories)
memories = Dropout(DROPOUT_R)(memories)
content = Emb_C(word_seq)
content = T_C(content)
mem_relevance = Dot(axes=[1, 2])([layer_output, memories])
weighted_internal_state = int_state_weights(mem_relevance)
mem_relevance = Softmax()(mem_relevance)
content_relevance = Dot(axes=1)([mem_relevance,
content]) # weight each piece of content by it's probability of being relevant
layer_output = Add()([content_relevance, weighted_internal_state])
layer_output = Dropout(DROPOUT_R)(layer_output)
final_output = Dense(units = VOCAB_SIZE, activation ='relu',
kernel_initializer=RandomNormal(mean=0., stddev = 0.05, seed = None))(layer_output)
model = Model(inputs = [word_seq, query], outputs = prediction)
model.compile(optimizer = SGD(lr = 0.01, clipnorm = 50.), loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.fit(x = [td_seqs, td_query], y = [td_labels],
batch_size = BATCH_SIZE, callbacks = [lr_adjust, lr_termination, for_csv], epochs=200, verbose = 1)
批次大小当前为128。在我添加T_A和T_C部件之前,这在35000个训练样本中进行得很好,最终的准确率为96%。一旦我实施了T_A和T_C(位置编码),训练就以大约10%的准确率和5.2%的训练损失结束。我将训练数据增加了10倍,但没有看到任何真正的改善。这是我增加的重量课程:
class Added_Weights(Layer):
def __init__(self, input_dim, **kwargs):
super(Added_Weights, self).__init__(**kwargs)
self.input_dim = input_dim
def build(self, input_shape):
# Create a trainable weight variable for this layer.
self.kernel = self.add_weight(name='kernel',
shape=(self.input_dim[0], self.input_dim[1]),
initializer=RandomNormal(mean=0., stddev=0.05, seed=None),
trainable=True)
super(Added_Weights, self).build(input_shape)
def call(self, x, **kwargs):
return x + self.kernel
def compute_output_shape(self, input_shape):
return input_shape
在阅读了这两篇非常棒的论文后,我为为什么这行不通而苦恼,这两篇论文都明确指出这应该行得通。如果有人能帮上忙,那就太神奇了。我不知道这是如何实现论文中提到的位置编码的。你只是在增加权重。这很可能是个问题,但报纸上到底有什么不同呢?他们两个听起来都像是在增加体重。