Python 3.x 为什么训练测试分割对回归结果的影响如此巨大?

Python 3.x 为什么训练测试分割对回归结果的影响如此巨大?,python-3.x,scikit-learn,linear-regression,Python 3.x,Scikit Learn,Linear Regression,我正在对Kaggle的房价预测比赛进行测试 为方便起见,请在下面找到使用简单线性回归模型下载、预处理和开始预测的完整过程: 下载数据 获取列车和测试数据 分裂数值和试验数据 定义一个清理功能来处理NaN/None 进行清洁 预处理数据,即对分类特征进行热编码 将培训数据拆分为“内部”培训和测试集 火车。。。这就是有趣的地方 问题: 因此,在上述过程中,计算Kaggle度量(即RMLSE)时会出现错误,因为某些值为负值。有趣的是,如果我将test_size参数从0.5更改为0.2,那么就不会再有负

我正在对Kaggle的房价预测比赛进行测试

为方便起见,请在下面找到使用简单线性回归模型下载、预处理和开始预测的完整过程:

下载数据 获取列车和测试数据 分裂数值和试验数据 定义一个清理功能来处理NaN/None 进行清洁 预处理数据,即对分类特征进行热编码 将培训数据拆分为“内部”培训和测试集 火车。。。这就是有趣的地方 问题: 因此,在上述过程中,计算Kaggle度量(即RMLSE)时会出现错误,因为某些值为负值。有趣的是,如果我将test_size参数从0.5更改为0.2,那么就不会再有负值了。可以理解的是,训练中使用的数据越多,模型的性能就越好。但是,如果我将其从0.2移动到0.3(不太剧烈的变化,即约100个训练样本),那么模型预测负值的问题再次出现

两个问题:

  • 这是否是预期的,即模型如此敏感的 培训数据?这更清楚,因为如果测试大小=0.2 与shuffle=False一起使用,则它可以工作。如果在洗牌时使用= 如果为True,则模型开始预测负值

  • 如何处理这种行为?显然,这是一个非常简单的问题 模型(无标准化、无缩放、无正则化…)但 相信真正了解世界上正在发生的事情是很有趣的 这是一个非常简单的模型

  • 这是否是预期的,即模型对训练数据如此敏感?这更清楚,因为如果将test_size=0.2与shuffle=False一起使用,那么它就工作了。如果在shuffle=True时使用,则模型开始预测负值

    对于你的问题,是的,这种分裂很重要

    如何处理这种行为?显然,这是一个非常简单的模型(没有标准化,没有缩放,没有正则化…),但我相信真正理解这个非常简单的模型中发生了什么是有趣的

    你听说过交叉验证吗

    其概念是使用多个数据片段训练分类器/回归,这些数据片段始终具有不同的训练/测试分割,以避免您正在解释的这种行为,然后您可以真正判断预测质量,因为新数据也可能具有多个不同的结构。
    因此,您需要运行几个迭代,然后判断结果。

    您好。是的,我知道交叉验证。据我所知,这通常是在没有足够数据的情况下使用的。现在人们可以争论什么是“足够的数据”。根据我的理解,第二个问题仍然存在,当你有这样的行为时,我想知道人们是怎么做的,我觉得交叉验证并不能解决问题,它只会像我一样表现出来。现在下一步就是解决它。那么正规化呢?改变模式?标准化?我知道我可以做很多事情,但我想知道当遇到这样的问题时,是否有一个“配方”可以遵循。也许最好的方法是调整您当前的模型,并采用结果最稳定的模型,正如我在可能发生的情况之前所说的,新数据也有奇怪的附加值,所以,你应该建立一个模型,它可能精度较低,但更稳定
    from kaggle.api.kaggle_api_extended import KaggleApi
    
    api = KaggleApi()
    api.authenticate()
    saveDir = "data"
    if not os.path.exists("data"):
        os.makedirs(saveDir)
    api.competition_download_files("house-prices-advanced-regression-techniques","data")
    
    print("the following files have been downloaded \n" + '\n'.join('{}'.format(item) for item in os.listdir("data")))
    print("they are located in " + saveDir)
    
    train = pd.read_csv(saveDir + r"\train.csv")
    test = pd.read_csv(saveDir + r"\test.csv")
    
    xTrain = train.iloc[:,1:-1] # remove id & SalePrice
    yTrain = train.iloc[:,-1] # SalePrice
    xTest = test.iloc[:,1:] # remove id
    
    catData = xTrain.columns[xTrain.dtypes == object]
    numData = list(set(xTrain.columns) - set(catData))
    print("The number of columns in the original dataframe is " + str(len(xTrain.columns)))
    print("The number of columns in the categorical and numerical data dds up to " + str(len(catData)+len(numData)))
    
    def cleanData(data, catData, numData) : 
    dataClean = data.copy()
    
    # Let's deal with NaN ...
    
    # check where there are NaN in categorical
    dataClean[catData].columns[dataClean[catData].isna().any(axis=0)]
    
    # take care that some categorical could be numerics so
    # differentiate the two cases
    dataTypes = [dataClean.loc[dataClean.loc[:,col].notnull(),col].apply(type).iloc[0] for col in catData] # get the data type for each column
                                                                                                             # have to be carefull to not take a data that is NaN or None
                                                                                                             # when evaluating its type
    from itertools import compress
    catDataNum = [True if ((col == float) | (col == int)) else False for col in dataTypes ] # if data type is numeric (float/int), register it
    catDataNum = list(compress(catData, catDataNum))
    catDataNotNum = list(set(catData)-set(catDataNum))
    
    print("The number of columns in the dataframe is " + str(len(dataClean.columns)))
    print("The number of columns in the categorical and numerical data dds up to " + 
          str(len(catDataNum) + len(catDataNotNum)+len(numData)))
    
    # Check what NA means for each feature ...
    # BsmtQual : NA means no basement
    # GarageType : NA means no garage
    # BsmtExposure : NA means no basement
    # Alley : NA means no alley access
    # BsmtFinType2 : NA means no basement
    # GarageFinish : NA means no garage
    # did not check the rest ... I will just replace with a category "No"
    
    # For categorical, NaN values mean the considered feature
    # do not exist (this requires dataset analysis as performed above)
    dataClean[catDataNotNum] = dataClean[catDataNotNum].fillna(value = 'No')
    mean = dataClean[catDataNum].mean()
    dataClean[catDataNum] = dataClean[catDataNum].fillna(value = mean)
    
    # for numerical, replace with mean
    mean = dataClean[numData].mean()
    dataClean[numData] = dataClean[numData].fillna(value = mean)
    
    return dataClean
    
    xTrainClean = cleanData(xTrain, catData, numData)
    
    # check if no NaN or None anymore
    if xTrainClean.isna().sum().sum() != 0:
        print(xTrainClean.iloc[:,xTrainClean.isna().any(axis=0).values])
    else :
        print("All good! No more NaN or None in training data!")
    
    # same with test data
    # perform the cleaning
    xTestClean = cleanData(xTest, catData, numData)
    
    # check if no NaN or None anymore
    if xTestClean.isna().sum().sum() != 0:
        print(xTestClean.iloc[:,xTestClean.isna().any(axis=0).values])
    else :
        print("All good! No more NaN or None in test data!")
    
    import sklearn as sk
    import numpy as np
    from sklearn.linear_model import LinearRegression
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder
    
    # We would like to perform a linear regression on all data
    # but some data are categorical ... 
    # so first, perform a one-hot encoding on categorical variables
    ct = ColumnTransformer(transformers = [("OneHotEncoder", OneHotEncoder(categories='auto', drop=None, 
                                          sparse=False, n_values='auto',
                                          handle_unknown = "error"),
                                          catData)],
                          remainder = "passthrough")
    ct = ct.fit(pd.concat([xTrainClean, xTestClean])) # fit on both xTrain & xTest to be sure to have all possible categorical values
                                            # test it separately (.fit(xTrain) / .fit(xTest) and analyze to understand)
                                            # resulting categories and values can be obtained through
                                            # ct.named_transformers_ ['OneHotEncoder'].categories_
    xTrainOneHot = ct.transform(xTrainClean)
    
    xTestOneHotKaggle = xTestOneHot.copy()
    
    from sklearn.model_selection import train_test_split
    xTrainInternalOneHot, xTestInternalOneHot, yTrainInternal, yTestInternal = train_test_split(xTrainOneHot, yTrain, test_size=0.5, random_state=42, shuffle = False)
    
    print("The training data now contains " + str(xTrainInternalOneHot.shape[0]) + " samples")
    print("The training data now contains " + str(yTrainInternal.shape[0]) + " labels")
    print("The test data now contains " + str(xTestInternalOneHot.shape[0]) + " samples")
    print("The test data now contains " + str(yTestInternal.shape[0]) + " labels")
    
    reg = LinearRegression().fit(xTrainInternalOneHot,yTrainInternal)
    yTrainInternalPredict = reg.predict(xTrainInternalOneHot)
    yTestInternalPredict = reg.predict(xTestInternalOneHot)
    print("The R2 score on training data is equal to " + str(reg.score(xTrainInternalOneHot,yTrainInternal)))
    print("The R2 score on the internal test data is equal to " + str(reg.score(xTestInternalOneHot, yTestInternal)))
    
    from sklearn.metrics import mean_squared_log_error
    print("Tke Kaggle metric score (RMSLE) on internal training data is equal to " + 
          str(np.sqrt(mean_squared_log_error(yTrainInternal, yTrainInternalPredict))))
    print("Tke Kaggle metric score (RMSLE) on internal test data is equal to " + 
          str(np.sqrt(mean_squared_log_error(yTestInternal, yTestInternalPredict))))