Python Keras TF:ValueError:输入数组的采样数应与目标数组的采样数相同

Python Keras TF:ValueError:输入数组的采样数应与目标数组的采样数相同,python,tensorflow,keras,generator,imdb,Python,Tensorflow,Keras,Generator,Imdb,使用带有Tensorflow后端的Keras DL库,我尝试使用内置IMDB数据集实现一个用于情绪分析的批处理和验证生成器 该数据集包含25000个训练样本和25000个测试样本。 因为为每个样本设置单词数的截止值会产生相当低的准确率,所以我尝试对训练和测试样本进行批处理,这样内存负载就不会太大 当前代码: from __future__ import print_function from keras.preprocessing import sequence from keras.model

使用带有Tensorflow后端的Keras DL库,我尝试使用内置IMDB数据集实现一个用于情绪分析的批处理和验证生成器

该数据集包含25000个训练样本和25000个测试样本。 因为为每个样本设置单词数的截止值会产生相当低的准确率,所以我尝试对训练和测试样本进行批处理,这样内存负载就不会太大

当前代码:

from __future__ import print_function
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding, Dropout
from keras.layers import LSTM, TimeDistributed
from keras.datasets import imdb
from keras.callbacks import EarlyStopping, ModelCheckpoint
import numpy as np


max_features = 20000

def generate_batch(batchsize):
'''

'''
(x_train, y_train), (_,_) = imdb.load_data()
for i in range(0, len(x_train), batchsize):
    x_batch = x_train[i:(i+batchsize)]
    y_batch = y_train[i:(i+batchsize)]
    x_batch = sequence.pad_sequences(x_train, maxlen=None)
    yield(x_batch, y_batch)

def generate_val(valsize):
'''
'''
(_,_), (x_test, y_test) = imdb.load_data()    
for i in range(0, len(x_test), valsize):
    x_val = x_test[i:(i+valsize)]
    y_val = y_test[i:(i+valsize)]
    x_val = sequence.pad_sequences(x_test, maxlen=None)
    yield(x_val, y_val)

print('Build model...')
primary_model = Sequential()
primary_model.add(Embedding(input_dim = max_features,
                    output_dim = max_features,
                    trainable=False, 
                    weights=[(np.eye(max_features,max_features))], 
                    mask_zero=True))
primary_model.add(TimeDistributed(Dense(150, use_bias=False)))
primary_model.add(LSTM(128))
primary_model.add(Dense(2, activation='softmax'))
primary_model.summary()
primary_model.compile(loss='sparse_categorical_crossentropy', 
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
filepath = "primeweights-{epoch:02d}-{val_acc:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath,
                            verbose=1,
                            save_best_only=True)
early_stopping_monitor = EarlyStopping(patience=2)

primary_model.fit_generator(generate_batch(25),
                            steps_per_epoch = 1000,
                            epochs = 1, 
                            callbacks=[early_stopping_monitor],
                            validation_data=generate_val(25),
                            validation_steps=1000)


score, acc = primary_model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

primary_model.save('primary_model_imdb.h5')
但是,在尝试运行当前代码时,Keras抛出以下错误:

Traceback (most recent call last):
File "imdb_gen.py", line 94, in <module>
validation_steps = 1000)   
File "/home/d/user/.local/lib/python3.5/site-
packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/d/user/.local/lib/python3.5/site-packages/keras/models.py", 
line 1276, in fit_generator
initial_epoch=initial_epoch)
File "/home/d/user/.local/lib/python3.5/site-
packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/d/user/.local/lib/python3.5/site-
packages/keras/engine/training.py", line 2224, in fit_generator
class_weight=class_weight)
File "/home/d/user/.local/lib/python3.5/site-
packages/keras/engine/training.py", line 1877, in train_on_batch
class_weight=class_weight)
File "/home/d/user/.local/lib/python3.5/site-
packages/keras/engine/training.py", line 1490, in _standardize_user_data
_check_array_lengths(x, y, sample_weights)
File "/home/d/user/.local/lib/python3.5/site-
packages/keras/engine/training.py", line 220, in _check_array_lengths
'and ' + str(list(set_y)[0]) + ' target samples.')
ValueError: Input arrays should have the same number of samples as target 
arrays. Found 25000 input samples and 25 target samples.
回溯(最近一次呼叫最后一次):
文件“imdb_gen.py”,第94行,在
验证(步骤=1000)
文件“/home/d/user/.local/lib/python3.5/site-
packages/keras/legacy/interfaces.py”,第91行,包装中
返回函数(*args,**kwargs)
文件“/home/d/user/.local/lib/python3.5/site packages/keras/models.py”,
第1276行,安装在发电机中
初始_历元=初始_历元)
文件“/home/d/user/.local/lib/python3.5/site-
packages/keras/legacy/interfaces.py”,第91行,包装中
返回函数(*args,**kwargs)
文件“/home/d/user/.local/lib/python3.5/site-
机组/keras/engine/training.py”,第2224行,装配式发电机
等级重量=等级重量)
文件“/home/d/user/.local/lib/python3.5/site-
包装/keras/engine/training.py”,1877行,批量生产
等级重量=等级重量)
文件“/home/d/user/.local/lib/python3.5/site-
packages/keras/engine/training.py”,第1490行,用户数据中
_检查数组长度(x、y、样本权重)
文件“/home/d/user/.local/lib/python3.5/site-
packages/keras/engine/training.py”,第220行,检查数组长度
'和'+str(list(set_y)[0])+'目标样本'.)
ValueError:输入数组的样本数应与目标数组的样本数相同
数组。找到25000个输入样本和25个目标样本。

代码中存在多个错误:

  • 正如@Y.Luo在评论中指出的:
  • 加载imdb数据集时,您必须提供
    num_words=max_features
    ,否则您的嵌入层将期望输入max_words作为max_features,但最终得到的word_id将大于此值
  • 建议在使用填充时提供批的
    maxlen
    ,否则将使用批的
    maxlen
    ,这可能会随着批的不同而变化
  • 您使用嵌入层时没有对其进行训练,也没有保持输入和输出维度不变。对我来说没有意义,所以我改变了
  • 要在测试数据上评估模型,您必须首先加载它,然后将其转换为填充序列
  • 在使用生成器进行多个时期的训练时,我们必须确保生成器不断产生值。为此,一旦数据集结束,我们需要再次从0开始屈服

完整的工作代码被链接(用第6点的修复程序更新)

这是由于您的代码输入错误:
x\u batch=sequence。pad\u sequences(x\u train,maxlen=None)
为您提供了整个填充的
x\u train
,将包含25000个样本。您可能需要
x\u batch=sequence.pad\u sequences(x\u batch,maxlen=None)
太好了,谢谢。关于嵌入层,我正在训练这个模型,以便以后用于优化给定节点的最佳权重集的输入。然而,有了这个固定的代码,程序不会运行超过第一个历元,也不会停止迭代超过第一个历元。你知道这是什么原因吗?
x_batch = sequence.pad_sequences(x_train, maxlen=None) # gives 25000 samples

x_batch = sequence.pad_sequences(x_batch, maxlen=None) # gives batch_size
(x_train, y_train), (_,_) = imdb.load_data(num_words=max_features)
x_batch = sequence.pad_sequences(x_batch, maxlen=maxlen, padding='post')
primary_model.add(Embedding(input_dim = max_features,
                    output_dim = embedding_dim,
                    trainable=True, 
                    weights=[(np.eye(max_features,embedding_dim))], 
                    mask_zero=True))
(_,_), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen, padding='post')
score, acc = primary_model.evaluate(x_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)
def generate_batch(batchsize):

    (x_train, y_train), (_,_) = imdb.load_data(num_words=max_features)
    print("train_size", x_train.shape)
    while True:
        for i in range(0, len(x_train), batchsize):
            x_batch = x_train[i:(i+batchsize)]
            y_batch = y_train[i:(i+batchsize)]
            x_batch = sequence.pad_sequences(x_batch, maxlen=maxlen, padding='post')
            yield(x_batch, y_batch)