Python LSTM自动编码器_Python_Machine Learning_Tensorflow_Deep Learning_Keras

Python LSTM自动编码器

python machine-learning tensorflow deep-learning keras

Python LSTM自动编码器,python,machine-learning,tensorflow,deep-learning,keras,Python,Machine Learning,Tensorflow,Deep Learning,Keras,我正在尝试构建一个LSTM自动编码器，目标是从序列中获得一个固定大小的向量，它尽可能好地表示序列。该自动编码器由两部分组成： LSTM编码器：获取序列并返回输出向量（return\u sequences=False） LSTM解码器：获取输出向量并返回序列（return\u sequences=True）因此，最终，编码器是多对一LSTM，解码器是一对多LSTM 图像来源：在较高级别上，编码如下所示（类似于所述）：数据数组的形状（训练示例数、序列长度、输入维度）为（1200、10、5

我正在尝试构建一个LSTM自动编码器，目标是从序列中获得一个固定大小的向量，它尽可能好地表示序列。该自动编码器由两部分组成：

```
LSTM
```
编码器：获取序列并返回输出向量（
```
return\u sequences=False
```
）
```
LSTM
```
解码器：获取输出向量并返回序列（
```
return\u sequences=True
```
）

因此，最终，编码器是多对一LSTM，解码器是一对多LSTM

图像来源：

在较高级别上，编码如下所示（类似于所述）：

数据

数组的形状（训练示例数、序列长度、输入维度）为

（1200、10、5）

，如下所示：

array([[[1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [0, 0, 1, 0, 0],
        ..., 
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],
        ... ]

问题：我不确定如何继续，尤其是如何将

LSTM

集成到

Model

以及如何让解码器从向量生成序列

我将

keras

与

tensorflow

后端一起使用

编辑：如果有人想尝试，下面是我用移动序列（包括填充）生成随机序列的过程：

随机导入
输入数学
def getNotSoRandomList（x）：
rlen=8
rlist=[0表示范围内的x（rlen）]
如果x模型可以是您想要的任何方式。如果我理解正确，您只想知道如何使用LSTM创建模型
使用LSTM
首先，你必须定义你的编码向量。假设您希望它是一个由20个元素组成的数组，一个一维向量。所以，形状（无，20）。它的大小取决于你，并且没有明确的规则来知道理想的一个
你的输入必须是三维的，比如你的（1200,10,5）。在keras摘要和错误消息中，它将显示为（无，10,5），因为“无”表示批次大小，每次训练/预测时，批次大小可能会有所不同
有很多方法可以做到这一点，但是，假设您只需要一个LSTM层：
from keras.layers import *
from keras.models import Model

inpE = Input((10,5)) #here, you don't define the batch size   
outE = LSTM(units = 20, return_sequences=False, ...optional parameters...)(inpE)

这对于一个非常简单的编码器来说已经足够了，它可以生成一个包含20个元素的数组（但如果需要，可以堆叠更多层）。让我们创建模型：
encoder = Model(inpE,outE)   

现在，对于解码器来说，它变得模糊了。你不再有一个实际的序列，而是一个静态的有意义的向量。您可能仍然希望使用LTSMs，他们会假设向量是一个序列
但在这里，由于输入具有形状（无，20），因此必须首先将其重塑为某个三维阵列，以便接下来附加LSTM层
你将如何重塑它完全取决于你自己。1个元素的20个步骤？一步20个元素？2个元素的10个步骤？谁知道呢
inpD = Input((20,))   
outD = Reshape((10,2))(inpD) #supposing 10 steps of 2 elements    

重要的是要注意，如果你不再有10个步骤，你就不能只启用“return_sequences”并获得你想要的输出。你得工作一点。实际上，不必使用“return_sequences”，甚至不必使用LSTMs，但您可以这样做
因为在我的重塑中我有10个时间步（有意），所以可以使用“return\u sequences”，因为结果将有10个时间步（作为初始输入）
您可以通过许多其他方式工作，例如只需创建一个50单元的LSTM，而不返回序列，然后重塑结果：
alternativeOut = LSTM(50,return_sequences=False,...)(outD)    
alternativeOut = Reshape((10,5))(alternativeOut)

我们的模型是：
decoder = Model(inpD,outD1)  
alternativeDecoder = Model(inpD,alternativeOut)   

然后，将模型与代码结合起来，并训练自动编码器。
所有三个模型都具有相同的权重，因此您可以使用编码器的predict
方法使编码器产生结果
encoderPredictions = encoder.predict(data)


关于生成序列的LSTM，我经常看到类似于预测下一个元素的东西
您只需获取序列中的几个元素，然后尝试查找下一个元素。然后你再向前走一步，以此类推。这可能有助于生成序列
 您可以在这里找到一个简单的顺序到顺序自动编码器：
这里是一个示例
让我们创建一个由几个序列组成的合成数据。这个想法是通过自动编码器的镜头来观察这些序列。换句话说，降低维度或将其汇总为固定长度
# define input sequence
sequence = np.array([[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 
                     [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
                     [0.2, 0.4, 0.6, 0.8],
                     [0.3, 0.6, 0.9, 1.2]])

# prepare to normalize
x = pd.DataFrame(sequence.tolist()).T.values
scaler = preprocessing.StandardScaler()
x_scaled = scaler.fit_transform(x)
sequence_normalized = [col[~np.isnan(col)] for col in  x_scaled.T]

# make sure to use dtype='float32' in padding otherwise with floating points
sequence = pad_sequences(sequence, padding='post', dtype='float32')

# reshape input into [samples, timesteps, features]
n_obs = len(sequence)
n_in = 9
sequence = sequence.reshape((n_obs, n_in, 1))

让我们设计一个简单的LSTM
#define encoder
visible = Input(shape=(n_in, 1))
encoder = LSTM(2, activation='relu')(visible)

# define reconstruct decoder
decoder1 = RepeatVector(n_in)(encoder)
decoder1 = LSTM(100, activation='relu', return_sequences=True)(decoder1)
decoder1 = TimeDistributed(Dense(1))(decoder1)

# tie it together
myModel = Model(inputs=visible, outputs=decoder1)

# summarize layers
print(myModel.summary())


#sequence = tmp
myModel.compile(optimizer='adam', loss='mse')

history = myModel.fit(sequence, sequence, 
                      epochs=400, 
                      verbose=0, 
                      validation_split=0.1, 
                      shuffle=True)

plot_model(myModel, show_shapes=True, to_file='reconstruct_lstm_autoencoder.png')
# demonstrate recreation
yhat = myModel.predict(sequence, verbose=0)
# yhat

import matplotlib.pyplot as plt

#plot our loss 
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model train vs validation loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()


让我们构建自动编码器
# use our encoded layer to encode the training input
decoder_layer = myModel.layers[1]

encoded_input = Input(shape=(9, 1))
decoder = Model(encoded_input, decoder_layer(encoded_input))

# we are interested in seeing how the encoded sequences with lenght 2 (same as the dimension of the encoder looks like)
out = decoder.predict(sequence)

f = plt.figure()
myx = out[:,0]
myy = out[:,1]
s = plt.scatter(myx, myy)

for i, txt in enumerate(out[:,0]):
    plt.annotate(i+1, (myx[i], myy[i]))

这是序列的表示
为什么要生成随机序列？@lelloman:只是为了测试。我希望测试这个就足够了。我认为这应该是可行的，因为这项任务是重建，而不是寻找模式。我只是出于好奇，我真的不是专家。但是自动编码器不需要模式才能工作吗？@lelloman:我想你最终是对的。但我也会猜测，如果问题不是太大，并且输出向量的维数足够高，可以在不丢失大量信息的情况下对随机序列进行编码。但可能是我错了。希望我们能了解情况。我的真实数据当然不是随机的。你可以从这里使用一对多的方法：谢谢：）这非常有用。您定义了两次变量inp
和out
，这很容易混淆（如果复制并粘贴代码，将导致错误）。但我明白了。而且它看起来像是重塑（（10,2））
expectout
作为参数。不管怎样，我在一个不太随机的序列上测试了你的想法（一个移动的1，比如10->01->01）。生成的序列如下所示[7.61515856e-01，0.00000000 E+00，0.00000000 E+00，-7.51162320e-02，-8.43070745e-02，…]
，损失约为0.08至0.13。如果你对如何进一步改进这一点有任何想法，我很想知道：）对不起，我犯了一个小错误。正如我纠正的那样，重塑（（10,2））需要inpD如果你正在进行多个时期的训练，而你的成绩没有达到你的预期，也许你需要更多的层次。有可能这个模型不是“内部的”
# define input sequence
sequence = np.array([[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 
                     [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
                     [0.2, 0.4, 0.6, 0.8],
                     [0.3, 0.6, 0.9, 1.2]])

# prepare to normalize
x = pd.DataFrame(sequence.tolist()).T.values
scaler = preprocessing.StandardScaler()
x_scaled = scaler.fit_transform(x)
sequence_normalized = [col[~np.isnan(col)] for col in  x_scaled.T]

# make sure to use dtype='float32' in padding otherwise with floating points
sequence = pad_sequences(sequence, padding='post', dtype='float32')

# reshape input into [samples, timesteps, features]
n_obs = len(sequence)
n_in = 9
sequence = sequence.reshape((n_obs, n_in, 1))

#define encoder
visible = Input(shape=(n_in, 1))
encoder = LSTM(2, activation='relu')(visible)

# define reconstruct decoder
decoder1 = RepeatVector(n_in)(encoder)
decoder1 = LSTM(100, activation='relu', return_sequences=True)(decoder1)
decoder1 = TimeDistributed(Dense(1))(decoder1)

# tie it together
myModel = Model(inputs=visible, outputs=decoder1)

# summarize layers
print(myModel.summary())


#sequence = tmp
myModel.compile(optimizer='adam', loss='mse')

history = myModel.fit(sequence, sequence, 
                      epochs=400, 
                      verbose=0, 
                      validation_split=0.1, 
                      shuffle=True)

plot_model(myModel, show_shapes=True, to_file='reconstruct_lstm_autoencoder.png')
# demonstrate recreation
yhat = myModel.predict(sequence, verbose=0)
# yhat

import matplotlib.pyplot as plt

#plot our loss 
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model train vs validation loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

# use our encoded layer to encode the training input
decoder_layer = myModel.layers[1]

encoded_input = Input(shape=(9, 1))
decoder = Model(encoded_input, decoder_layer(encoded_input))

# we are interested in seeing how the encoded sequences with lenght 2 (same as the dimension of the encoder looks like)
out = decoder.predict(sequence)

f = plt.figure()
myx = out[:,0]
myy = out[:,1]
s = plt.scatter(myx, myy)

for i, txt in enumerate(out[:,0]):
    plt.annotate(i+1, (myx[i], myy[i]))