Tensorflow 深度学习回归-巨大的mse和损失

Tensorflow 深度学习回归-巨大的mse和损失,tensorflow,machine-learning,keras,deep-learning,regression,Tensorflow,Machine Learning,Keras,Deep Learning,Regression,我试图训练一个模型来预测汽车价格。数据集来自kaggle: 我正在使用以下代码准备数据: class CarDataset(DataSet): def __init__(self, csv_file): df = pd.read_csv(csv_file).drop(["dateCrawled", "name", "abtest", "dateCreated", "nrOfPictures", "postalCode", "lastSeen"], axis = 1)

我试图训练一个模型来预测汽车价格。数据集来自kaggle:

我正在使用以下代码准备数据:

class CarDataset(DataSet):

    def __init__(self, csv_file):
        df = pd.read_csv(csv_file).drop(["dateCrawled", "name", "abtest", "dateCreated", "nrOfPictures", "postalCode", "lastSeen"], axis = 1)

        df = df.drop(df[df["seller"] == "gewerblich"].index).drop(["seller"], axis = 1)
        df = df.drop(df[df["offerType"] == "Gesuch"].index).drop(["offerType"], axis = 1)

        df = df[df["vehicleType"].notnull()]
        df = df[df["notRepairedDamage"].notnull()]
        df = df[df["model"].notnull()]
        df = df[df["fuelType"].notnull()]

        df = df[(df["price"] > 100) & (df["price"] < 100000)]
        df = df[(df["monthOfRegistration"] > 0) & (df["monthOfRegistration"] < 13)]
        df = df[(df["yearOfRegistration"] < 2019) & (df["yearOfRegistration"] > 1950)]
        df = df[(df["powerPS"] > 20) & (df["powerPS"] < 550)]

        df["hasDamage"] = np.where(df["notRepairedDamage"] == "ja", 1, 0)
        df["automatic"] = np.where(df["gearbox"] == "manuell", 1, 0)
        df["fuel"] = np.where(df["fuelType"] == "benzin", 0, 1)
        df["age"] = (2019 - df["yearOfRegistration"]) * 12 + df["monthOfRegistration"]

        df = df.drop(["notRepairedDamage", "gearbox", "fuelType", "yearOfRegistration", "monthOfRegistration"], axis = 1)

        df = pd.get_dummies(df, columns = ["vehicleType", "model", "brand"])

        self.df = df
        self.Y = self.df["price"].values
        self.X = self.df.drop(["price"], axis = 1).values

        scaler = StandardScaler()
        scaler.fit(self.X)

        self.X = scaler.transform(self.X)

        self.x_train, self.x_test, self.y_train, self.y_test = train_test_split(self.X, 
                                                                                    self.Y, 
                                                                                    test_size = 0.25,
                                                                                    random_state = 0)

        self.x_train, self.x_valid, self.y_train, self.y_valid = train_test_split(self.x_train, 
                                                                                    self.y_train, 
                                                                                    test_size = 0.25,
                                                                                    random_state = 0)   

    def get_input_shape(self):
        return (len(self.df.columns)-1, )        # (303, )
hasDamage
是一个标志(0或1),指示车辆是否存在未修复的损坏
automatic
是一个标志(0或1),指示车辆是手动换档还是自动换档
燃油
柴油为0,汽油为1
age
是以月为单位的车龄

将使用
df=pd对
brand
model
vehicleType
列进行热编码

此外,我还将使用
标准缩放器
来转换X值

数据集现在包含303列X,当然Y是“price”列

使用此数据集,常规
线性回归
在训练和测试集上的得分将达到~0.7

现在,我已经尝试了使用keras的深度学习方法,但无论我做什么,mse和损失都在急剧增加,模型似乎无法学习任何东西:

input_tensor = model_stack = Input(dataset.get_input_shape()) # (303, )
model_stack = Dense(20)(model_stack)
model_stack = Activation("relu", name = "relu_1")(model_stack)

model_stack = Dense(20)(model_stack)
model_stack = Activation("relu", name = "relu_2")(model_stack)

model_stack = Dense(1, name = "Output")(model_stack)

model = Model(inputs = [input_tensor], outputs = [model_stack])
model.compile(loss = "mse", optimizer = optimizer(lr = learning_rate), metrics = ['mse'])

model.summary()

callbacks = []
callbacks.append(ReduceLROnPlateau(monitor = "val_loss", factor = 0.95, verbose = self.verbose, patience = 1))
callbacks.append(EarlyStopping(monitor='val_loss', patience = 5, min_delta = 0.01, restore_best_weights = True, verbose = self.verbose))


model.fit(x = dataset.x_train,
          y = dataset.y_train,
          verbose = 1,
          batch_size = 128,
          epochs = 200,
          validation_data = [dataset.x_valid, dataset.y_valid],
          callbacks = callbacks)

score = model.evaluate(dataset.x_test, dataset.y_test, verbose = 1)
print("Model score: {}".format(score))

总结/培训如下(学习率为
3e-4
):


我还是机器学习的初学者。我的方法是否有任何重大/明显的错误?我做错了什么?

我没有什么建议

  • 添加隐藏层中的神经元数量

  • 在您的情况下,尽量不要使用
    relu
    ,而是使用
    tanh

  • 删除
    退出层
    直到模型开始工作,然后您可以将其添加回并重新训练


  • 你的模型似乎不太合适

    Try adding more neurons as suggested already. 
    And also try to increase the number of layers. 
    Try using sigmoid as your activation function. 
    Try increasing your learning rate. You can switch between Adam or SGD learning as well. 
    
    从头开始的模型拟合总是经过反复试验。尝试一次更改一个参数。然后两个一起换,以此类推。此外,我建议您寻找相关的文件或工作已经完成的数据集类似于您的。这将为您提供一些指导。

    解决方案 因此,过了一会儿,我找到了正确数据集的kaggle链接。我使用的是第一个,但是同样的数据也是这样的:加上222个内核来看看。 现在看一看,这个家伙也因为损失得到了“大数字”,这是我困惑的主要部分(因为我到目前为止只处理了“小数字”或“精确性”),这让我重新思考

    然后我就很清楚了:

    • 数据集准备正确
    • 模型工作正常
    • 我使用了错误的度量标准/与
      sklearn
      s
      LinearRegression
      的其他度量标准进行了比较,这些度量标准无论如何都不具有可比性
    简言之:

    • 2000年左右的MAE(平均绝对误差)意味着,对于汽车价格的预测,平均而言,它在2000欧元之前是错的(例如,正确价格为10000欧元,模型预测为8000欧元-12000欧元)
    • MSE(均方误差)当然是一个更大的数字,这是意料之中的,而不是我第一次解释的“垃圾”或错误的模型结果
    • “准确度”指标用于分类,对回归毫无用处
    • sklearn
      s
      LinearRegression
      的默认评分函数是r2评分
    因此,我将度量值更改为“mae”和自定义r2实现,以便将其与
    线性回归进行比较

    结果证明,在第一次尝试大约100个时代后,我最终的MAE为1900,r2分数为0.69

    然后,为了比较,我还计算了
    线性相关
    的MAE,其评估值为2855.417(r2得分为0.67)


    因此,事实上,深度学习方法在MAE和r2分数方面已经更好了。因此,没有什么问题,我现在可以继续调整/优化模型:)

    我想你是在预测价格,对吗?损失正在减少,因此一些学习可能正在发生。(1) 你应该消除辍学现象:这是一种正规化技术,可以防止过度拟合,但在这里似乎不会发生这种情况。如果发生过拟合(2)您应该显示精度指标:metrics=['mse','accurity'](3)运行更多的纪元,那么您可以在以后添加它。你现在看到了什么?从
    val_损失:98661204.1644-val_均方误差:98661204.1644-val_acc:6.5732e-05
    val_损失:8097733.0068-val_均方误差:8097733.0068-val_acc:6.5732e-04
    。我还有
    ReduceLROnPlateau
    ,它在第128纪元之后开始生效。另外值得注意的是,我开始时的学习率是
    3e-4
    。看来这没多大帮助?对我来说,似乎我在设置/数据/模型方面遇到了更大的问题,所以它目前只是在产生垃圾?是的,看起来不太好。你能发布你的完整代码,包括数据加载和预处理吗?用数据集准备代码编辑了最初的文章,删除了辍学,为没有辍学的模型添加了结果&200个纪元。还添加了原始数据集与kaggle页面的链接。使用“添加隐藏层中的神经元数量”。你是说我要增加神经元数量?请看我编辑的原始帖子,它包含了一个没有辍学的模型的新结果。此外,我还添加了数据集信息/准备代码。根据你的建议(tanh和128/64个神经元),我将得到大致相同的结果(
    loss:101604805.5432-均方误差:101604805.5432-acc:0.0000e+00-val\u loss:101726501.6806-val\u均方误差:101726501.6806-val\u acc:0.0000e+00
    ),我还测试了200个或更多的神经元,变化不大。您是如何实现
    StandardScaler
    ?我在你的更新代码中看不到它@user826955添加了代码。很抱歉,数据准备被封装在两个类中,scaler
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    input_1 (InputLayer)         (None, 6)                 0         
    _________________________________________________________________
    dense_1 (Dense)              (None, 20)                140       
    _________________________________________________________________
    relu_1 (Activation)          (None, 20)                0         
    _________________________________________________________________
    dense_2 (Dense)              (None, 20)                420       
    _________________________________________________________________
    relu_2 (Activation)          (None, 20)                0         
    _________________________________________________________________
    Output (Dense)               (None, 1)                 21        
    =================================================================
    Total params: 581
    Trainable params: 581
    Non-trainable params: 0
    _________________________________________________________________
    Train on 182557 samples, validate on 60853 samples
    Epoch 1/200
    182557/182557 [==============================] - 2s 13us/step - loss: 110046953.4602 - mean_squared_error: 110046953.4602 - acc: 0.0000e+00 - val_loss: 107416331.4062 - val_mean_squared_error: 107416331.4062 - val_acc: 0.0000e+00
    Epoch 2/200
    182557/182557 [==============================] - 2s 11us/step - loss: 97859920.3050 - mean_squared_error: 97859920.3050 - acc: 0.0000e+00 - val_loss: 85956634.8803 - val_mean_squared_error: 85956634.8803 - val_acc: 1.6433e-05
    Epoch 3/200
    182557/182557 [==============================] - 2s 12us/step - loss: 70531052.0493 - mean_squared_error: 70531052.0493 - acc: 2.1911e-05 - val_loss: 54933938.6787 - val_mean_squared_error: 54933938.6787 - val_acc: 3.2866e-05
    Epoch 4/200
    182557/182557 [==============================] - 2s 11us/step - loss: 42639802.3204 - mean_squared_error: 42639802.3204 - acc: 3.2866e-05 - val_loss: 32645940.6536 - val_mean_squared_error: 32645940.6536 - val_acc: 1.3146e-04
    Epoch 5/200
    182557/182557 [==============================] - 2s 11us/step - loss: 28282909.0699 - mean_squared_error: 28282909.0699 - acc: 1.4242e-04 - val_loss: 25315220.7446 - val_mean_squared_error: 25315220.7446 - val_acc: 9.8598e-05
    Epoch 6/200
    182557/182557 [==============================] - 2s 11us/step - loss: 24279169.5270 - mean_squared_error: 24279169.5270 - acc: 3.8344e-05 - val_loss: 23420569.2554 - val_mean_squared_error: 23420569.2554 - val_acc: 9.8598e-05
    Epoch 7/200
    182557/182557 [==============================] - 2s 11us/step - loss: 22874003.0459 - mean_squared_error: 22874003.0459 - acc: 9.8599e-05 - val_loss: 22380401.0622 - val_mean_squared_error: 22380401.0622 - val_acc: 1.6433e-05
    ...
    Epoch 197/200
    182557/182557 [==============================] - 2s 12us/step - loss: 13828827.1595 - mean_squared_error: 13828827.1595 - acc: 3.3414e-04 - val_loss: 14123447.1746 - val_mean_squared_error: 14123447.1746 - val_acc: 3.1223e-04
    
    Epoch 00197: ReduceLROnPlateau reducing learning rate to 0.00020950120233464986.
    Epoch 198/200
    182557/182557 [==============================] - 2s 13us/step - loss: 13827193.5994 - mean_squared_error: 13827193.5994 - acc: 2.4102e-04 - val_loss: 14116898.8054 - val_mean_squared_error: 14116898.8054 - val_acc: 1.6433e-04
    
    Epoch 00198: ReduceLROnPlateau reducing learning rate to 0.00019902614221791736.
    Epoch 199/200
    182557/182557 [==============================] - 2s 12us/step - loss: 13823582.4300 - mean_squared_error: 13823582.4300 - acc: 3.3962e-04 - val_loss: 14108715.5067 - val_mean_squared_error: 14108715.5067 - val_acc: 4.1083e-04
    Epoch 200/200
    182557/182557 [==============================] - 2s 11us/step - loss: 13820568.7721 - mean_squared_error: 13820568.7721 - acc: 3.1223e-04 - val_loss: 14106001.7681 - val_mean_squared_error: 14106001.7681 - val_acc: 2.3006e-04
    60853/60853 [==============================] - 1s 18us/step
    Model score: [14106001.790199332, 14106001.790199332, 0.00023006260989597883]
    
    input_tensor = model_stack = Input(dataset.get_input_shape()) # (303, )
    model_stack = Dense(128)(model_stack)
    model_stack = Activation("tanh", name = "tanh_1")(model_stack)
    
    model_stack = Dense(64)(model_stack)
    model_stack = Activation("tanh", name = "tanh_2")(model_stack)
    
    model_stack = Dense(1, name = "Output")(model_stack)
    
    model = Model(inputs = [input_tensor], outputs = [model_stack])
    model.compile(loss = "mse", optimizer = optimizer(lr = learning_rate), metrics = ['mse'])
    
    model.summary()
    
    Try adding more neurons as suggested already. 
    And also try to increase the number of layers. 
    Try using sigmoid as your activation function. 
    Try increasing your learning rate. You can switch between Adam or SGD learning as well.