如何在Keras中实现多标签文本分类器?

如何在Keras中实现多标签文本分类器?,keras,deep-learning,nlp,text-classification,multilabel-classification,Keras,Deep Learning,Nlp,Text Classification,Multilabel Classification,我一直在尝试创建一个多标签文本分类器,它使用Keras的手套嵌入。我目前正在试验TREC-6数据集,可在。我只考虑我的分类问题的5大标签,而忽略了子标签 由于这是一个多标签分类问题,给定一个句子,我的神经网络应该以大于0.1的概率输出所有标签。问题是,网络几乎总是用一个标签对一个问题进行分类,当我问一个只属于一个类别的问题时,这很好。然而,当我将不同类别的问题组合在一起时,它在大多数情况下仍然只给出一个具有高度可信度的标签,尽管我希望识别所有相关标签。 我绝对相信预处理步骤是正确的。我觉得我的模

我一直在尝试创建一个多标签文本分类器,它使用Keras的手套嵌入。我目前正在试验TREC-6数据集,可在。我只考虑我的分类问题的5大标签,而忽略了子标签

由于这是一个多标签分类问题,给定一个句子,我的神经网络应该以大于0.1的概率输出所有标签。问题是,网络几乎总是用一个标签对一个问题进行分类,当我问一个只属于一个类别的问题时,这很好。然而,当我将不同类别的问题组合在一起时,它在大多数情况下仍然只给出一个具有高度可信度的标签,尽管我希望识别所有相关标签。 我绝对相信预处理步骤是正确的。我觉得我的模型有些问题

一开始,我只使用CNN进行实验,参考了年的论文“用于句子分类的卷积神经网络”,但根据老师的建议,在看到它们在不同相关主题的长问题上失败后,我尝试使用LSTM和BILSTM进行实验。我从这个方法开始,一直在修改一些参数/添加和删除层,希望得到一个好的结果,但到目前为止我失败了。 我试着复制粘贴一些注意机制的代码,并在我的LSTM层之后添加这些代码,但没有帮助

我当前的模型看起来有点像这样。为了清晰起见,我将粘贴其余大部分代码。模型和训练代码出现在句子_classifier()函数中

class SentenceClassifier:
    def __init__(self):
        self.MAX_SEQUENCE_LENGTH = 200
        self.EMBEDDING_DIM = 100
        self.LABEL_COUNT = 0
        self.WORD_INDEX = dict()
        self.LABEL_ENCODER = None

    def clean_str(self, string):
        """
        Cleans each string and convert to lower case.
        """

        string = re.sub(r"\'s", "", string)
        string = re.sub(r"\'ve", "", string)
        string = re.sub(r"n\'t", " not", string)
        string = re.sub(r"\'re", "", string)
        string = re.sub(r"\'d", "", string)
        string = re.sub(r"\'ll", "", string)
        string = re.sub(r"[^A-Za-z0-9]", " ", string)
        string = re.sub(r"\s{2,}", " ", string)

        return string.strip().lower()

    def loader_encoder(self, table, type="json"):
        """
        Load and encode data from dataset.

        type = "sql" means get data from MySQL database.
        type = "json" means get data from .json file.        """

        if type == "json":
            with open('data/' + table + '.json', 'r', encoding='utf8') as f:
                datastore = json.load(f)
                questions = []
                tags = []
                for row in datastore:
                    questions.append(row['question'])
                    tags.append(row['tags'].split(','))

        tokenizer = Tokenizer(lower=True, char_level=False)
        tokenizer.fit_on_texts(questions)
        self.WORD_INDEX = tokenizer.word_index

        questions_encoded = tokenizer.texts_to_sequences(questions)
        questions_encoded_padded = pad_sequences(questions_encoded, maxlen=self.MAX_SEQUENCE_LENGTH, padding='post')


        for i, ele in enumerate(tags):
            for j, tag in enumerate(ele):
                if len(tag) == 0 or tag == ',':
                    del tags[i][j]


        encoder = MultiLabelBinarizer()
        encoder.fit(tags)
        self.LABEL_ENCODER = encoder
        tags_encoded = encoder.fit_transform(tags)
        self.LABEL_COUNT = len(tags_encoded[0]) #No. of labels
        print("\tUnique Tokens in Training Data: ", len(self.WORD_INDEX))


        return questions_encoded_padded, tags_encoded

    def load_embeddings(self, EMBED_PATH='./embeddings/glove.6B.100d.txt'):
        """
        Load pre-trained embeddings into memory.
        """
        embeddings_index = {}
        try:
            f = open(EMBED_PATH, encoding='utf-8')
        except FileNotFoundError:
            print("Embeddings missing.")
            sys.exit()
        for line in f:
            values = line.rstrip().rsplit(' ')
            word = values[0]
            vec = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = vec
        f.close()
        print("\tNumber of tokens in embeddings file: ", len(embeddings_index))
        return embeddings_index

    def create_embedding_matrix(self, embeddings_index):
        """
        Creates an embedding matrix for all the words(vocab) in the training data with shape (vocab, EMBEDDING_DIM).
        Out-of-vocab words will be randomly initialized to values between +0.25 and -0.25.
        """
        words_not_found = []
        vocab = len(self.WORD_INDEX) + 1
        embedding_matrix = np.random.uniform(-0.25, 0.25, size=(vocab, self.EMBEDDING_DIM))
        for word, i in self.WORD_INDEX.items():
            if i >= vocab:
                continue
            embedding_vector = embeddings_index.get(word)
            if (embedding_vector is not None) and len(embedding_vector) > 0:
                embedding_matrix[i] = embedding_vector
            else:
                words_not_found.append(word)
        # print('Number of null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))
        print("\tShape of embedding matrix: ", str(embedding_matrix.shape))
        print("\tNo. of words not found in pre-trained embeddings: ", len(words_not_found))
        return embedding_matrix

    def sentence_classifier_cnn(self, embedding_matrix, x, y, table, load_saved=0):
        """
        A static CNN model.
        Makes uses of Keras functional API for constructing the model.

        If load_saved=1, THEN load old model, ELSE train new model


        model_name = table + ".model.h5"
        if load_saved == 1 and os.path.exists('./saved/' + model_name):
            print("\nLoading saved model...")
            model = load_model('./saved/' + model_name)
            print("Model Summary")
            print(model.summary())
        """


        print("\nTraining model...")
        inputs = Input(shape=(self.MAX_SEQUENCE_LENGTH,), dtype='int32')
        embedding = Embedding(input_dim=(len(self.WORD_INDEX) + 1), output_dim=self.EMBEDDING_DIM,
                                weights=[embedding_matrix],
                                input_length=self.MAX_SEQUENCE_LENGTH)(inputs)

        X = keras.layers.SpatialDropout1D(0.3)(embedding)

        X = keras.layers.Bidirectional(keras.layers.CuDNNLSTM(64, return_sequences = True))(X)  

        #X2 = keras.layers.Bidirectional(keras.layers.CuDNNGRU(128, return_sequences = False))(X)  

        X = keras.layers.Conv1D(32, kernel_size=2, padding='valid', kernel_initializer='normal')(X)           
        X = keras.layers.GlobalMaxPooling1D()(X)       
        #X = Attention(self.MAX_SEQUENCE_LENGTH)(X)

        X = Dropout(0.5)(X)
        X = keras.layers.Dense(16, activation="relu")(X)
        X = Dropout(0.5)(X)
        X = keras.layers.BatchNormalization()(X)

        output = Dense(units=self.LABEL_COUNT, activation='sigmoid')(X)

        model = Model(inputs=inputs, outputs=output, name='intent_classifier')
        print("Model Summary")
        print(model.summary())

        cbk = OutputObserver(model, classifier)
        model.compile(loss='binary_crossentropy',
                          optimizer='adam',
                          metrics=['accuracy'])
        model.fit(x, y,
                      batch_size=30,
                      epochs=23,
                      verbose=2,
                      callbacks = [cbk])


        #keras.utils.vis_utils.plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)

        return model

    def tag_question(self, model, question):

        question = self.clean_str(question)
        question_encoded = [[self.WORD_INDEX[w] for w in question.split(' ') if w in self.WORD_INDEX]]
        question_encoded_padded = pad_sequences(question_encoded, maxlen=self.MAX_SEQUENCE_LENGTH, padding='post')
        predictions = model.predict(question_encoded_padded)

        possible_tags = []
        for i, probability in enumerate(predictions[0]):
            if probability >= 0.01:
                possible_tags.append([self.LABEL_ENCODER.classes_[i], probability])

        possible_tags.sort(reverse=True, key=lambda x:x[1]) #sort in place on the basis of the probability in each sub-list in descending order
        print(possible_tags)


    def setup_classifier(self, table):
        '''

        '''
        print("Loading Data Set...")
        x, y = self.loader_encoder(table)

        embeddings_index = self.load_embeddings()

        print("\nGenerating embedding matrix...")
        embedding_matrix = self.create_embedding_matrix(embeddings_index)

        #Loading / Training model
        model = self.sentence_classifier_cnn(embedding_matrix, x, y, table, load_saved=1)

        return model, embeddings_index

    def connect_to_db(self):
        mydb = mysql.connector.connect(host="localhost", user="root", passwd="root", database="questiondb")
        cursor = mydb.cursor()
        return mydb, cursor
如您所见,我使用回调在每个步骤后打印预测。 我曾尝试预测各种问题的标签,但我只对属于一个类别的问题得到了很好的结果

比如说,

classifier.tag_question(model, "how many days before new year?")
给予

作为输出。但更复杂的问题是

classifier.tag_question(model, "who is the prophet of the muslim people and  where is india located and how much do fruits costs there?")
给出类似于

[['human', 0.9990531]]
虽然像“位置”和“数字”这样的标签也与输出相关。 我用一个回调函数来预测每个时代之后对这个问题的预测,我看到了类似这样的情况

Epoch 1/23
 - 19s - loss: 0.6581 - acc: 0.6365
[['human', 0.69752634], ['location', 0.40014982], ['entity', 0.32047516], ['abbreviation', 0.23877779], ['numeric', 0.23324837], ['description', 0.15995058]]
Epoch 2/23
 - 12s - loss: 0.4525 - acc: 0.8264
[['human', 0.7437608], ['location', 0.18141672], ['entity', 0.14474556], ['numeric', 0.09171515], ['description', 0.053900182], ['abbreviation', 0.05283475]]
Epoch 3/23
 - 12s - loss: 0.3854 - acc: 0.8478
[['human', 0.86335427], ['location', 0.12673976], ['entity', 0.09847507], ['numeric', 0.064431995], ['description', 0.035599917], ['abbreviation', 0.02441895]]
Epoch 4/23
 - 12s - loss: 0.3634 - acc: 0.8509
[['human', 0.90795004], ['location', 0.10085008], ['entity', 0.09804481], ['numeric', 0.050411616], ['description', 0.032810867], ['abbreviation', 0.014970899]]
Epoch 5/23
 - 13s - loss: 0.3356 - acc: 0.8582
[['human', 0.8365586], ['entity', 0.1130701], ['location', 0.10253032], ['numeric', 0.039931685], ['description', 0.02874279]]
Epoch 6/23
 - 13s - loss: 0.3142 - acc: 0.8657
[['human', 0.95577633], ['entity', 0.088555306], ['location', 0.055004593], ['numeric', 0.015950901], ['description', 0.01428318]]
Epoch 7/23
 - 13s - loss: 0.2942 - acc: 0.8750
[['human', 0.89538944], ['entity', 0.130977], ['location', 0.06350105], ['description', 0.023014158], ['numeric', 0.019377537]]
Epoch 8/23
 - 13s - loss: 0.2739 - acc: 0.8802
[['human', 0.9725125], ['entity', 0.061141968], ['location', 0.026945814], ['description', 0.010931551]]
Epoch 9/23
 - 13s - loss: 0.2579 - acc: 0.8914
[['human', 0.9797143], ['entity', 0.042518377], ['location', 0.027904237]]
Epoch 10/23
 - 13s - loss: 0.2380 - acc: 0.9020
[['human', 0.7897601], ['entity', 0.14315197], ['location', 0.07439863], ['description', 0.019453615], ['numeric', 0.010681627]]
Epoch 11/23
 - 13s - loss: 0.2250 - acc: 0.9104
[['human', 0.9886158], ['entity', 0.024878502], ['location', 0.015951043]]
Epoch 12/23
 - 13s - loss: 0.2131 - acc: 0.9178
[['human', 0.9677731], ['entity', 0.03698206], ['location', 0.026153017]]
Epoch 13/23
 - 13s - loss: 0.2029 - acc: 0.9204
[['human', 0.9514474], ['entity', 0.053581357], ['location', 0.029657435]]
Epoch 14/23
 - 13s - loss: 0.1915 - acc: 0.9285
[['human', 0.9706739], ['entity', 0.0328649], ['location', 0.013876333]]
Epoch 15/23
 - 13s - loss: 0.1856 - acc: 0.9300
[['human', 0.9328136], ['location', 0.05573874], ['entity', 0.025918543]]
Epoch 16/23
 - 13s - loss: 0.1802 - acc: 0.9318
[['human', 0.9895527], ['entity', 0.014941782], ['location', 0.011972391]]
Epoch 17/23
 - 13s - loss: 0.1717 - acc: 0.9373
[['human', 0.9426272], ['entity', 0.03754583], ['location', 0.023379702]]
Epoch 18/23
 - 13s - loss: 0.1614 - acc: 0.9406
[['human', 0.99186605]]
Epoch 19/23
 - 13s - loss: 0.1573 - acc: 0.9432
[['human', 0.9926062]]
Epoch 20/23
 - 13s - loss: 0.1511 - acc: 0.9448
[['human', 0.9993554]]
Epoch 21/23
 - 13s - loss: 0.1591 - acc: 0.9426
[['human', 0.9964465]]
Epoch 22/23
 - 13s - loss: 0.1507 - acc: 0.9451
[['human', 0.999688]]
Epoch 23/23
 - 13s - loss: 0.1524 - acc: 0.9436
[['human', 0.9990531]]
我已经尝试过数百次改变我的参数,特别是我的网络大小、批量大小和年代,以避免过度拟合

我知道我的问题太长了,但我已经没有耐心了,任何帮助都将不胜感激


这是我的colab笔记本的链接-。

我也在尝试解决论文中的多标签文本分类问题。你解决问题了吗?我检查了你的代码,你没有实现注意机制。你把它掉了吗?我知道这个评论可能对解决你的问题没有帮助。但我在寻求一些建议。
Epoch 1/23
 - 19s - loss: 0.6581 - acc: 0.6365
[['human', 0.69752634], ['location', 0.40014982], ['entity', 0.32047516], ['abbreviation', 0.23877779], ['numeric', 0.23324837], ['description', 0.15995058]]
Epoch 2/23
 - 12s - loss: 0.4525 - acc: 0.8264
[['human', 0.7437608], ['location', 0.18141672], ['entity', 0.14474556], ['numeric', 0.09171515], ['description', 0.053900182], ['abbreviation', 0.05283475]]
Epoch 3/23
 - 12s - loss: 0.3854 - acc: 0.8478
[['human', 0.86335427], ['location', 0.12673976], ['entity', 0.09847507], ['numeric', 0.064431995], ['description', 0.035599917], ['abbreviation', 0.02441895]]
Epoch 4/23
 - 12s - loss: 0.3634 - acc: 0.8509
[['human', 0.90795004], ['location', 0.10085008], ['entity', 0.09804481], ['numeric', 0.050411616], ['description', 0.032810867], ['abbreviation', 0.014970899]]
Epoch 5/23
 - 13s - loss: 0.3356 - acc: 0.8582
[['human', 0.8365586], ['entity', 0.1130701], ['location', 0.10253032], ['numeric', 0.039931685], ['description', 0.02874279]]
Epoch 6/23
 - 13s - loss: 0.3142 - acc: 0.8657
[['human', 0.95577633], ['entity', 0.088555306], ['location', 0.055004593], ['numeric', 0.015950901], ['description', 0.01428318]]
Epoch 7/23
 - 13s - loss: 0.2942 - acc: 0.8750
[['human', 0.89538944], ['entity', 0.130977], ['location', 0.06350105], ['description', 0.023014158], ['numeric', 0.019377537]]
Epoch 8/23
 - 13s - loss: 0.2739 - acc: 0.8802
[['human', 0.9725125], ['entity', 0.061141968], ['location', 0.026945814], ['description', 0.010931551]]
Epoch 9/23
 - 13s - loss: 0.2579 - acc: 0.8914
[['human', 0.9797143], ['entity', 0.042518377], ['location', 0.027904237]]
Epoch 10/23
 - 13s - loss: 0.2380 - acc: 0.9020
[['human', 0.7897601], ['entity', 0.14315197], ['location', 0.07439863], ['description', 0.019453615], ['numeric', 0.010681627]]
Epoch 11/23
 - 13s - loss: 0.2250 - acc: 0.9104
[['human', 0.9886158], ['entity', 0.024878502], ['location', 0.015951043]]
Epoch 12/23
 - 13s - loss: 0.2131 - acc: 0.9178
[['human', 0.9677731], ['entity', 0.03698206], ['location', 0.026153017]]
Epoch 13/23
 - 13s - loss: 0.2029 - acc: 0.9204
[['human', 0.9514474], ['entity', 0.053581357], ['location', 0.029657435]]
Epoch 14/23
 - 13s - loss: 0.1915 - acc: 0.9285
[['human', 0.9706739], ['entity', 0.0328649], ['location', 0.013876333]]
Epoch 15/23
 - 13s - loss: 0.1856 - acc: 0.9300
[['human', 0.9328136], ['location', 0.05573874], ['entity', 0.025918543]]
Epoch 16/23
 - 13s - loss: 0.1802 - acc: 0.9318
[['human', 0.9895527], ['entity', 0.014941782], ['location', 0.011972391]]
Epoch 17/23
 - 13s - loss: 0.1717 - acc: 0.9373
[['human', 0.9426272], ['entity', 0.03754583], ['location', 0.023379702]]
Epoch 18/23
 - 13s - loss: 0.1614 - acc: 0.9406
[['human', 0.99186605]]
Epoch 19/23
 - 13s - loss: 0.1573 - acc: 0.9432
[['human', 0.9926062]]
Epoch 20/23
 - 13s - loss: 0.1511 - acc: 0.9448
[['human', 0.9993554]]
Epoch 21/23
 - 13s - loss: 0.1591 - acc: 0.9426
[['human', 0.9964465]]
Epoch 22/23
 - 13s - loss: 0.1507 - acc: 0.9451
[['human', 0.999688]]
Epoch 23/23
 - 13s - loss: 0.1524 - acc: 0.9436
[['human', 0.9990531]]