Python 随机森林在训练和测试中获得98%的准确率，但在其他方面总是预测同一类_Python_Machine Learning_Scikit Learn_Random Forest_Imbalanced Data

Python 随机森林在训练和测试中获得98%的准确率，但在其他方面总是预测同一类

python machine-learning scikit-learn

Python 随机森林在训练和测试中获得98%的准确率，但在其他方面总是预测同一类,python,machine-learning,scikit-learn,random-forest,imbalanced-data,Python,Machine Learning,Scikit Learn,Random Forest,Imbalanced Data,我花了30个小时来解决这个单一的问题，这是毫无意义的，希望你们中的一个人能给我展示一个不同的视角问题是，我在随机森林中使用我的训练数据帧，获得了98%-99%的非常好的准确率，但当我尝试加载一个新样本来预测时。模型总是猜测同一类 # Shuffle the data-frames records. The labels are still attached df = df.sample(frac=1).reset_index(drop=True) # Extract the labels

我花了30个小时来解决这个单一的问题，这是毫无意义的，希望你们中的一个人能给我展示一个不同的视角

问题是，我在随机森林中使用我的训练数据帧，获得了98%-99%的非常好的准确率，但当我尝试加载一个新样本来预测时。模型总是猜测同一类

#  Shuffle the data-frames records. The labels are still attached
df = df.sample(frac=1).reset_index(drop=True)

#  Extract the labels and then remove them from the data
y = list(df['label'])
X = df.drop(['label'], axis='columns')

#  Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE)

#  Construct the model
model = RandomForestClassifier(n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH, random_state=RANDOM_STATE,oob_score=True)

#  Calculate the training accuracy
in_sample_accuracy = model.fit(X_train, y_train).score(X_train, y_train)
#  Calculate the testing accuracy
test_accuracy = model.score(X_test, y_test)

print()
print('In Sample Accuracy: {:.2f}%'.format(model.oob_score_ * 100))
print('Test Accuracy: {:.2f}%'.format(test_accuracy * 100))

    #  The json file is not in the correct format, this function normalizes it
    normalized_json = json_normalizer(json_file, "", training=False)
    #  Turn the json into a list of dictionaries which contain the features
    features_dict = create_dict(normalized_json, label=None)

    #  Convert the dictionaries into pandas dataframes
    df = pd.DataFrame.from_records(features_dict)
    print('Total amount of email samples: ', len(df))
    print()

    df = df.fillna(-1)
    #  One hot encodes string values
    df = one_hot_encode(df, noOverride=True)
    if 'label' in df.columns:
        df = df.drop(['label'], axis='columns')
    print(list(model.predict(df))[:100])
    print(list(model.predict(X_train))[:100])

我处理数据的方式是一样的，但当我在X_测试或X_训练中预测时，我得到了正常的98%，当我在新数据中预测时，它总是猜测同一类

#  Shuffle the data-frames records. The labels are still attached
df = df.sample(frac=1).reset_index(drop=True)

#  Extract the labels and then remove them from the data
y = list(df['label'])
X = df.drop(['label'], axis='columns')

#  Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE)

#  Construct the model
model = RandomForestClassifier(n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH, random_state=RANDOM_STATE,oob_score=True)

#  Calculate the training accuracy
in_sample_accuracy = model.fit(X_train, y_train).score(X_train, y_train)
#  Calculate the testing accuracy
test_accuracy = model.score(X_test, y_test)

print()
print('In Sample Accuracy: {:.2f}%'.format(model.oob_score_ * 100))
print('Test Accuracy: {:.2f}%'.format(test_accuracy * 100))

    #  The json file is not in the correct format, this function normalizes it
    normalized_json = json_normalizer(json_file, "", training=False)
    #  Turn the json into a list of dictionaries which contain the features
    features_dict = create_dict(normalized_json, label=None)

    #  Convert the dictionaries into pandas dataframes
    df = pd.DataFrame.from_records(features_dict)
    print('Total amount of email samples: ', len(df))
    print()

    df = df.fillna(-1)
    #  One hot encodes string values
    df = one_hot_encode(df, noOverride=True)
    if 'label' in df.columns:
        df = df.drop(['label'], axis='columns')
    print(list(model.predict(df))[:100])
    print(list(model.predict(X_train))[:100])

上面是我的测试场景，您可以在我预测的最后两行中看到

X_train

用于训练模型的数据和

df

它总是猜测0类的样本外数据

一些有用的信息：

数据集不平衡；0类约有150000个样本，而1类约有600000个样本
共有141个功能
改变n_估计量和最大深度并不能解决这个问题

任何想法都会很有帮助，如果你需要更多的信息，请告诉我我的大脑现在已经崩溃了，这就是我所能想到的。

修复了，问题是数据集的不平衡。我也意识到改变深度会给我不同的结果

例如，10棵树有3个深度->似乎效果很好

10棵树，6个深度->返回到只猜测同一个类，问题是数据集的不平衡，我也意识到改变深度会给我不同的结果

例如，10棵树有3个深度->似乎效果很好

10棵树有6个深度->回到只猜测同一类的问题上来

两个在作品1中没有回答的问题。在训练模型之前，您是否应用了任何措施来处理不平衡数据？2.在训练模型之前，是否对数据进行了随机抽样？3.在构建模型之前是否应用了交叉验证？@mnm因此，模型在几天前就开始工作，即使在不平衡数据的情况下也能准确预测情况，因此我没有尝试。数据是随机抽样的，我甚至尝试对训练中使用的样本进行重新处理和预测，结果每次都猜到了同一个类。请检查

df

是否填写正确？也许这都是-1在

df=df.fillna（-1）

之后？只是一个猜测。准确度对于不完整的数据来说并不太好，因为它可以引导模型在大多数类别上正确预测。您需要（1）对数据进行重新采样，使类的重新呈现或多或少是均匀的（2）权重类（3）选择更稳健的指标，如AUC或f1@DL_Engineer根据OP，类1比类0大4倍，因此很可能最初模型仅适用于类1。试着用ROC AUC作为评估指标建立一个新模型。有几个问题在作品1中没有得到回答。在训练模型之前，您是否应用了任何措施来处理不平衡数据？2.在训练模型之前，是否对数据进行了随机抽样？3.在构建模型之前是否应用了交叉验证？@mnm因此，模型在几天前就开始工作，即使在不平衡数据的情况下也能准确预测情况，因此我没有尝试。数据是随机抽样的，我甚至尝试对训练中使用的样本进行重新处理和预测，结果每次都猜到了同一个类。请检查

df

是否填写正确？也许这都是-1在

df=df.fillna（-1）

之后？只是一个猜测。准确度对于不完整的数据来说并不太好，因为它可以引导模型在大多数类别上正确预测。您需要（1）对数据进行重新采样，使类的重新呈现或多或少是均匀的（2）权重类（3）选择更稳健的指标，如AUC或f1@DL_Engineer根据OP，类1比类0大4倍，因此很可能最初模型仅适用于类1。尝试使用ROC AUC作为评估指标构建新模型。