Python 仅在第一个历元之后，LSTM模型的val_acc为1.0？_Python_Tensorflow_Machine Learning_Keras_Lstm

Python 仅在第一个历元之后，LSTM模型的val_acc为1.0？

python tensorflow machine-learning keras

Python 仅在第一个历元之后，LSTM模型的val_acc为1.0？,python,tensorflow,machine-learning,keras,lstm,Python,Tensorflow,Machine Learning,Keras,Lstm,我正在使用LSTM生成新闻标题。它应该根据序列中以前的字符预测下一个字符。我有一个超过一百万条新闻标题的文件，但出于速度原因，我选择了看10万条随机选择的新闻标题当我尝试训练我的模型时，在第一个历元中，它达到了1.0验证精度和0.9986训练精度。这当然不可能是正确的。我不认为缺乏数据是问题所在，因为90000个训练数据点应该足够了。这似乎不仅仅是你的基本过度装修。每一个历元花费的时间似乎过多，大约2.5分钟，但我以前从未使用过LSTMs，所以我不确定在列车时间方面会发生什么。是什么导致我的模

我正在使用LSTM生成新闻标题。它应该根据序列中以前的字符预测下一个字符。我有一个超过一百万条新闻标题的文件，但出于速度原因，我选择了看10万条随机选择的新闻标题

当我尝试训练我的模型时，在第一个历元中，它达到了1.0验证精度和0.9986训练精度。这当然不可能是正确的。我不认为缺乏数据是问题所在，因为90000个训练数据点应该足够了。这似乎不仅仅是你的基本过度装修。每一个历元花费的时间似乎过多，大约2.5分钟，但我以前从未使用过LSTMs，所以我不确定在列车时间方面会发生什么。是什么导致我的模型表现出这样的效果

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Import Libraries Section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
import csv
import numpy as np
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dropout, Dense  
import datetime
import matplotlib.pyplot as plt

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Load Data Section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
headlinesFull = []
with open("abcnews-date-text.csv", "r") as csv_file:
    csv_reader = csv.DictReader(csv_file, delimiter=',')
    for lines in csv_reader:
        headlinesFull.append(lines['headline_text'])

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Pretreat Data Section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
# shuffle and select 100000 headlines
np.random.shuffle(headlinesFull)
headlines = headlinesFull[:100000]

# add spaces to make ensure each headline is the same length as the longest headline
max_len = max(map(len, headlines))
headlines = [i + " "*(max_len-len(i)) for i in headlines]

# integer encode sequences of words
# create the tokenizer 
t = Tokenizer(char_level=True) 
# fit the tokenizer on the headlines 
t.fit_on_texts(headlines)
sequences = t.texts_to_sequences(headlines)

# vocabulary size
vocab_size = len(t.word_index) + 1

# separate into input and output
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]     
y = to_categorical(y, num_classes=vocab_size)
seq_len = X.shape[1]

# split data for validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Define Model Section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_len))
model.add(LSTM(100, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Train Model Section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
# fit model
model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=128, epochs=1)

Train on 90000 samples, validate on 10000 samples
Epoch 1/1
90000/90000 [==============================] - 161s 2ms/step - loss: 0.0493 - acc: 0.9986 - val_loss: 2.3842e-07 - val_acc: 1.0000

通过观察代码，我可以推断

您正在使用空格作为填充字符串以匹配最大值标题长度，标题=[i+*max_len-leni代表标题中的i] 标题转换为序列，只有在将所有标题设置为最大长度后，才能进行输入输出拆分。因此，对于大多数输入，最后一个字或输出或最后一个数字序列将是同一个填充符，这就是为什么您是即使在一个时代之后也能获得如此高的精度。解决方案：

您可以在标题的开头添加填充符，而不是在结尾追加

headlines = [" "*(max_len-len(i)) + i for i in headlines]

或者，在将标题拆分为X和Y之后，在每个输入的末尾添加填充符。

我决定将填充符添加到开头，因为这样它们只需要标记一次，就解决了问题。这是因为第一个非空格字符的位置存在额外的不确定性而更好的唯一原因吗？这仍然是一个非常缓慢的过程。由于递归，LSTM是否天生比其他类型的神经网络更耗时？