Deep learning Keras生成LSTM仅预测停止词

Deep learning Keras生成LSTM仅预测停止词,deep-learning,keras,lstm,data-science,Deep Learning,Keras,Lstm,Data Science,我在keras中创建了一个模型,使用LSTM预测给定单词序列的下一个单词。下面是我的代码: # Small LSTM Network to Generate Text for Alice in Wonderland # load ascii text and covert to lowercase filename = "wonderland.txt" raw_text = open(filename).read() raw_text = raw_text.lower() print r

我在keras中创建了一个模型,使用LSTM预测给定单词序列的下一个单词。下面是我的代码:

    # Small LSTM Network to Generate Text for Alice in Wonderland
# load ascii text and covert to lowercase
filename = "wonderland.txt"
raw_text = open(filename).read()
raw_text = raw_text.lower()
print raw_text
# create mapping of unique words to integers
print raw_text
raw_text = re.sub(r'[^\w\s]','',raw_text)
raw_text = re.sub('[^a-z\ \']+', " ", raw_text)
words_unsorted=list(raw_text.split())
words= sorted(list(set(raw_text.split())))
word_to_int = dict((w, i) for i, w in enumerate(words))
int_to_word = dict((i, w) for i, w in enumerate(words))
#print word_to_int

n_words = len(words_unsorted)
n_vocab = len(words)
print "Total Words: ", n_words
print "Total Vocab: ", n_vocab

# prepare the dataset of input to output pairs encoded as integers
seq_length = 7
dataX = []
dataY = []
for i in range(0, n_words - seq_length, 1):
    seq_in = words_unsorted[i:i + seq_length]
    seq_out = words_unsorted[i + seq_length]
    #print seq_in
    dataX.append([word_to_int[word] for word in seq_in])
    dataY.append(word_to_int[seq_out])


n_patterns = len(dataX)
print "Total Patterns: ", n_patterns

# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
print X[0]
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
print model.summary()
# define the checkpoint
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(X, y, epochs=50, batch_size=128, callbacks=callbacks_list)

问题是,当我预测一个测试句子时,我总是得到“and”作为下一个单词预测!我是否应该删除所有的停止词或其他内容?此外,我正在对其进行20个时代的培训。

我非常确定,根据帖子的年龄,您已经解决了您的问题。但以防万一,这是我的2美分

你最终预测了最频繁的单词。因此,如果删除停止词,您将预测下一个最常用的词。我知道有两种方法可以解决这个问题

首先,你可以使用loss来强调你案例中不太频繁的类或词。这是一个介绍焦距损失的例子,也很方便,是keras的一个实现

另一种方法是在拟合函数中使用class_权重

model.fit(X,y,epochs=50,batch\u size=128,callbacks=callbacks\u list,class\u weight=class\u weight)

您可以为频率较低或较高的单词设置权重,例如,与频率成反比

您应该首先尝试
stateful
LSTM。@MarcinMożejko尝试使用batch_size=27的stateful LSTM,但我再次得到停止字作为任何输入序列的唯一预测!