Keras：多类CNN文本分类器为所有输入数据预测同一类_Keras_Conv Neural Network_Word2vec_Text Classification

Keras：多类CNN文本分类器为所有输入数据预测同一类

keras

Keras：多类CNN文本分类器为所有输入数据预测同一类,keras,conv-neural-network,word2vec,text-classification,Keras,Conv Neural Network,Word2vec,Text Classification,我试图通过单词嵌入和CNN预测一些短文本数据的8个类中的一个。出现的问题是CNN分类器预测一个类中的所有内容（训练数据中最大的一个，分布约40%）。文本数据应该经过很好的预处理，因为分类可以很好地与其他分类器（如SVM或NB）配合使用 ''' #首先，在嵌入集中建立索引映射词到它们的嵌入向量导入操作系统嵌入_索引={} f=open（os.path.join（“”，'embedded_word2vec.txt'），encoding='utf-8'）对于f中的行： values=line.s

我试图通过单词嵌入和CNN预测一些短文本数据的8个类中的一个。出现的问题是CNN分类器预测一个类中的所有内容（训练数据中最大的一个，分布约40%）。文本数据应该经过很好的预处理，因为分类可以很好地与其他分类器（如SVM或NB）配合使用

''' #首先，在嵌入集中建立索引映射词到它们的嵌入向量导入操作系统嵌入_索引={} f=open（os.path.join（“”，'embedded_word2vec.txt'），encoding='utf-8'）对于f中的行： values=line.split（）单词=值[0] coefs=np.asarray（值[1:]）嵌入索引[word]=coefs f、关闭（）

'''

我很感激每一个暗示。多谢各位

#vectorize text samples in 2D Integer Tensor 
from tensorflow.python.keras.preprocessing.text import Tokenizer
tokenizer_obj = Tokenizer()
tokenizer_obj.fit_on_texts(stemmed_text)
sequences = tokenizer_obj.texts_to_sequences(stemmed_text)

#pad sequences
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
max_length = max([len(s.split(',')) for s in total_texts])
print('MaxLength :', max_length)
word_index = tokenizer_obj.word_index
print('Found %s unique Tokens.' % len(word_index))

text_pad = pad_sequences(sequences)
labels = to_categorical(np.asarray(labels))
print('Shape of label tensor:', labels.shape)
print('Shape of Text Tensor: ', text_pad.shape)

embedding_dim = 100
num_words = len(word_index) + 1

#prepare embedding matrix
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for index, word in enumerate(vocab):
    embedding_vector = w2v.wv[word]
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

#train test split 
validation_split = 0.2
indices = np.arange(text_pad.shape[0])
np.random.shuffle(indices)
text_pad = text_pad[indices]
labels = labels[indices]
num_validation_samples = int(validation_split * text_pad.shape[0])

x_train = text_pad[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_test = text_pad[-num_validation_samples:]
y_test = labels[-num_validation_samples:]

from keras.models import Sequential
from keras.layers.core import Dense, Flatten, Dropout
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.layers.embeddings import Embedding

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length=max_length, 
                    #embeddings_initializer = Constant(embedding_matrix),
                    weights=[embedding_matrix], 
                    trainable=False))

model.add(Dropout(0.2))
model.add(Conv1D(filters=64, kernel_size=5, activation='relu'))
model.add(MaxPooling1D(pool_size=4))
#model.add(Conv1D(filters=64, kernel_size=5, activation='relu'))
#model.add(MaxPooling1D(pool_size=4))
model.add(Flatten())
model.add(Dense(8, activation='sigmoid'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, y_train,
          epochs=25,
          validation_data=(x_test, y_test))
print(model.summary())