Python 基于训练LSTM模型的预测_Python_Tensorflow_Machine Learning_Keras_Lstm

Python 基于训练LSTM模型的预测

python tensorflow machine-learning keras

Python 基于训练LSTM模型的预测,python,tensorflow,machine-learning,keras,lstm,Python,Tensorflow,Machine Learning,Keras,Lstm,根据我收集的一些数据，我使用LSTM训练了一个模型。我想把它分为犬科动物和猫科动物我试图预测一串这样的文本 json_file = open('model.json', 'r') loaded_model_json = json_file.read() json_file.close() loaded_model = model_from_json(loaded_model_json) # load weights into new model loaded_model.load_weigh

根据我收集的一些数据，我使用LSTM训练了一个模型。我想把它分为犬科动物和猫科动物

我试图预测一串这样的文本

json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)

# load weights into new model
loaded_model.load_weights("lstm.hd5")
print("Loaded model from disk")


text_to_predict = ['A 2‐year‐old male domestic shorthair cat was presented for a progressive history of abnormal posture, behavior, and mentation. Menace response was absent bilaterally, and generalized tremors were identified on neurological examination. A neuroanatomical diagnosis of diffuse brain dysfunction was made. A neurodegenerative disorder was suspected. Magnetic resonance imaging findings further supported the clinical suspicion. Whole‐genome sequencing of the affected cat with filtering of variants against a database of unaffected cats was performed. Candidate variants were confirmed by Sanger sequencing followed by genotyping of a control population. Two homozygous private (unique to individual or families and therefore absent from the breed‐matched controlled population) protein‐changing variants in the major facilitator superfamily domain 8 (MFSD8) gene, a known candidate gene for neuronal ceroid lipofuscinosis type 7 (CLN7), were identified. The affected cat was homozygous for the alternative allele at both variants. This is the first report of a pathogenic alteration of the MFSD8 gene in a cat strongly suspected to have CLN7.']




MAX_SEQUENCE_LENGTH = 352
MAX_NB_WORDS = 2000

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, split=' ')
seq = tokenizer.texts_to_sequences(text_to_predict)
padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
pred = loaded_model.predict(padded)
labels = ['canine', 'feline']
print(pred, labels[np.argmax(pred)])

然而，无论我选择对哪个字符串进行分类，预测结果都是一样的

[0.5212073 0.47879276]]犬科

我也不确定为什么我必须将MAX_SEQUENCE_LENGTH设置为352，因为我的模型似乎期望一个这样大小的数组。将其设置为任何其他值将返回错误

ValueError: Error when checking input: expected embedding_1_input to have shape (352,) but got array with shape (250,)

我的模型训练，作为参考，是通过这段代码完成的

data = pd.read_csv('data.csv')
data['Text'] = data['Text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

MAX_NB_WORDS = 2000
embed_dim = 128
lstm_out = 196

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, split=' ')
tokenizer.fit_on_texts(data['Text'].values)
X = tokenizer.texts_to_sequences(data['Text'].values)
X = pad_sequences(X)


model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)

print('model string has been saved')

Y =  data[['canine','feline']]
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

batch_size = 32
model.fit(X_train, Y_train, epochs = 30, batch_size=batch_size, verbose = 2)

#save model for future use.
model.save('lstm.hd5')

非常感谢您的帮助：D

从您的问题中，我了解到

模型

在

培训

后预测正确，但在加载

保存的模型

后，它是

培训

相同的

类

我最近遇到了同样的问题，这个问题的解决方案是将

标记器

保存在

Pickle文件

中，并在加载

保存的模型

后，当我们想要执行

预测时加载Pickle文件

用于在Pickle文件中保存标记器的代码：
import pickle

# saving
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

加载Pickle文件的代码：
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer2 = pickle.load(handle)

除上述代码外，您的代码中还有一些其他观察结果：
在训练模型和对加载的模型执行预测时，建议使用相同的填充
因此，您可以从
X=pad\u序列（X）

到
加载模型前后，MAX\u SEQUENCE\u LENGTH
和MAX\u NB\u WORDS
的值应相同
建议在加载模型之前和之后执行相同的数据预处理步骤。因此，您也可以在加载模型后应用函数，（lambda x:re.sub（“[^a-zA-z0-9\s]”，“”，x））

下面提到了工作正常的代码：
data = pd.read_csv('data.csv')
data['Text'] = data['Text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

MAX_NB_WORDS = 2000
embed_dim = 128
lstm_out = 196

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, split=' ')
tokenizer.fit_on_texts(data['Text'].values)

import pickle  # IMPORTANT STEP

# saving
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

X = tokenizer.texts_to_sequences(data['Text'].values)
X = pad_sequences(X, maxlen = MAX_SEQUENCE_LENGTH) # Change Number 2

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)

print('model string has been saved')

Y =  data[['canine','feline']]
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

batch_size = 32
model.fit(X_train, Y_train, epochs = 30, batch_size=batch_size, verbose = 2)

#save model for future use.
model.save('lstm.hd5')

加载模型的修改代码如下所示：
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)

# load weights into new model
loaded_model.load_weights("lstm.hd5")
print("Loaded model from disk")


text_to_predict = ['A 2‐year‐old male domestic shorthair cat was presented for a progressive history of abnormal posture, behavior, and mentation. Menace response was absent bilaterally, and generalized tremors were identified on neurological examination. A neuroanatomical diagnosis of diffuse brain dysfunction was made. A neurodegenerative disorder was suspected. Magnetic resonance imaging findings further supported the clinical suspicion. Whole‐genome sequencing of the affected cat with filtering of variants against a database of unaffected cats was performed. Candidate variants were confirmed by Sanger sequencing followed by genotyping of a control population. Two homozygous private (unique to individual or families and therefore absent from the breed‐matched controlled population) protein‐changing variants in the major facilitator superfamily domain 8 (MFSD8) gene, a known candidate gene for neuronal ceroid lipofuscinosis type 7 (CLN7), were identified. The affected cat was homozygous for the alternative allele at both variants. This is the first report of a pathogenic alteration of the MFSD8 gene in a cat strongly suspected to have CLN7.']

text_to_predict = text_to_predict.apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x))) # CHANGE 3

MAX_SEQUENCE_LENGTH = 352
MAX_NB_WORDS = 2000

# Loading the Pickle File ==> IMPORTANT STEP
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer2 = pickle.load(handle)

# tokenizer = Tokenizer(num_words=MAX_NB_WORDS, split=' ') # THIS IS NOT REQUIRED
seq = tokenizer2.texts_to_sequences(text_to_predict)
padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
pred = loaded_model.predict(padded)
labels = ['canine', 'feline']
print(pred, labels[np.argmax(pred)])

如果这些更改没有给您提供所需的输出，请联系我，我将很乐意帮助您
希望这有帮助。学习愉快
 你需要标记器，你用来训练你的模型。这样，您将以np.array（[tokenizer.encode（'which string input'）]）的形式传入数据。
表示标记器对象没有属性encode。这可能是因为您没有使用tensorflow标记器：将tensorflow_数据集作为TFD导入；tokenizer=tfds.features.text.SubwordTextEncoder.build_from_corpus（干净的数据，目标语音）我会在PCI上尝试你的方法，但仍然不理解你的响应，我使用的是生成LSTM模型时使用的同一个tokenizer。但是，它不会生成具有相同形状的东西。当我评估模型时，我得到一个长度为352的int32数组。当我试着在一根弦上训练时，它会变成一个字符串长度的数组，而不是填充到352？
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)

# load weights into new model
loaded_model.load_weights("lstm.hd5")
print("Loaded model from disk")


text_to_predict = ['A 2‐year‐old male domestic shorthair cat was presented for a progressive history of abnormal posture, behavior, and mentation. Menace response was absent bilaterally, and generalized tremors were identified on neurological examination. A neuroanatomical diagnosis of diffuse brain dysfunction was made. A neurodegenerative disorder was suspected. Magnetic resonance imaging findings further supported the clinical suspicion. Whole‐genome sequencing of the affected cat with filtering of variants against a database of unaffected cats was performed. Candidate variants were confirmed by Sanger sequencing followed by genotyping of a control population. Two homozygous private (unique to individual or families and therefore absent from the breed‐matched controlled population) protein‐changing variants in the major facilitator superfamily domain 8 (MFSD8) gene, a known candidate gene for neuronal ceroid lipofuscinosis type 7 (CLN7), were identified. The affected cat was homozygous for the alternative allele at both variants. This is the first report of a pathogenic alteration of the MFSD8 gene in a cat strongly suspected to have CLN7.']

text_to_predict = text_to_predict.apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x))) # CHANGE 3

MAX_SEQUENCE_LENGTH = 352
MAX_NB_WORDS = 2000

# Loading the Pickle File ==> IMPORTANT STEP
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer2 = pickle.load(handle)

# tokenizer = Tokenizer(num_words=MAX_NB_WORDS, split=' ') # THIS IS NOT REQUIRED
seq = tokenizer2.texts_to_sequences(text_to_predict)
padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
pred = loaded_model.predict(padded)
labels = ['canine', 'feline']
print(pred, labels[np.argmax(pred)])