Keras 经过良好训练的分类器在相同的数据源中表现不好_Keras_Deep Learning_Nlp_Classification_Training Data

Keras 经过良好训练的分类器在相同的数据源中表现不好

keras deep-learning nlp

Keras 经过良好训练的分类器在相同的数据源中表现不好,keras,deep-learning,nlp,classification,training-data,Keras,Deep Learning,Nlp,Classification,Training Data,我正在做一个tweet分类项目，现在的任务是确定tweet是否与道路交通相关我收集了大量推文（超过100万条）。为了训练分类器，我首先使用关键词（如撞车、事故、道路等）获取一些与交通相关的候选推文。然后我手动查看这些推文，并检查这些推文是否与流量相关。我还构建了另一个与流量无关的数据集。最后，我收到1000条与流量相关的推文和2000条与流量无关的推文然后使用基于预训练词嵌入的CNN-LSTM模型对推文进行分类。模型的总体结构为： from tensorflow import keras f

我正在做一个tweet分类项目，现在的任务是确定tweet是否与道路交通相关

我收集了大量推文（超过100万条）。为了训练分类器，我首先使用关键词（如撞车、事故、道路等）获取一些与交通相关的候选推文。然后我手动查看这些推文，并检查这些推文是否与流量相关。我还构建了另一个与流量无关的数据集。最后，我收到1000条与流量相关的推文和2000条与流量无关的推文

然后使用基于预训练词嵌入的CNN-LSTM模型对推文进行分类。模型的总体结构为：

from tensorflow import keras
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.metrics import binary_accuracy, binary_crossentropy
from tensorflow.keras.layers import Dense, Dropout, Conv2D, Reshape, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import LSTM

def conv2d_lstm_weibo(kernel_height, dropout_rate):
    # Get the input information - tweet only
    tweet_input = Input(shape=(60, 300, 1), name='tweet_input')

    # Create the convolutional layer and lstm layer
    conv2d = Conv2D(filters=100, kernel_size=(kernel_height, 300), padding='valid',
                    activation='relu', use_bias=True, name='conv_1')(tweet_input)
    reshaped_conv2d = Reshape((conv2d.shape[1], 100), name='reshape_1')(conv2d)
    conv_drop = Dropout(dropout_rate, name='dropout_1')(reshaped_conv2d)
    lstm = LSTM(100, return_state=False, activation='tanh',
                recurrent_activation='hard_sigmoid', name='lstm_1')(conv_drop)
    lstm_drop = Dropout(dropout_rate, name='dropout_2')(lstm)
    dense_1 = Dense(100, activation='relu', name='dense_1')(lstm_drop)
    output = Dense(2, activation='softmax', name='output_dense')(dense_1)
    # Build the model
    model = Model(inputs=tweet_input, outputs=output)
    return model

预训练词向量的维数为

（300,1）

。培训数据、验证数据和测试数据分别占标记数据的60%、20%和20%。我使用的Adam优化器的学习率为0.001。实际上，该模型的性能相当好，在测试数据上达到了0.99 F1分数。关于测试数据模型性能的混淆矩阵为：

[[384   3]
 [  2 223]]

然而，当我使用经过训练的分类器对我收集的所有tweet进行预测时性能非常差。我手动检查分类器标记为

traffic\u related

的tweet，发现大多数tweet根本不与流量相关

我想知道我做的哪一部分是错的。非常感谢您的任何建议和见解