Python 在使用XGboost进行不平衡分类时，如何使用下采样和配置类权重参数？_Python_Machine Learning_Xgboost

Python 在使用XGboost进行不平衡分类时，如何使用下采样和配置类权重参数？

python machine-learning

Python 在使用XGboost进行不平衡分类时，如何使用下采样和配置类权重参数？,python,machine-learning,xgboost,Python,Machine Learning,Xgboost,我正在研究一个具有极端类不平衡的数据集上的二进制分类问题。为了帮助模型学习少数类的信号，我对多数类进行了降采样，使训练集有20%的少数类和80%的多数类现在还有另一个参数“scale\u pos\u weight”。我不知道如何在下采样后设置此参数我应该根据实际的等级比率来设置，还是应该在下采样后使用等级比率？因为您已经对数据进行了下采样，所以应该根据下采样的数据来设置scale\u pos\u weight参数。使用以下公式计算值： scale_pos_weight = count(neg

我正在研究一个具有极端类不平衡的数据集上的二进制分类问题。为了帮助模型学习少数类的信号，我对多数类进行了降采样，使训练集有20%的少数类和80%的多数类

现在还有另一个参数“scale\u pos\u weight”。我不知道如何在下采样后设置此参数

我应该根据实际的等级比率来设置，还是应该在下采样后使用等级比率？

因为您已经对数据进行了下采样，所以应该根据下采样的数据来设置

scale\u pos\u weight

参数。使用以下公式计算值：

scale_pos_weight = count(negative examples)/count(Positive examples)

对你来说

scale_pos_weight = 80/20 = 4

您还可以使用自动查找最佳参数集。

在计算损失函数时使用类别权重，以防止模型重视主要类别。如果一个类支配数据集，那么模型将倾向于更好地学习该类，因为损失主要由模型在该支配类上的性能决定

让我们考虑数据集包含99%个正样本的极端情况。如果一个模型只预测每个样本1，它将有99%的准确率类权重背后的理念是，您希望每个样本对损失的贡献相等。因此，您应该根据您的训练集计算此比率，因为损失是在您的训练集上计算的。你的模特对你掉下来的样本一无所知

如果预测正确，损失为0，否则为0。就你的情况而言，为了确保每个样本都对损失做出了同样的贡献，对少数群体的错误预测应该比对多数群体的错误预测受到4倍以上的惩罚。因此，该模型不能忽视某一阶级或偏向多数阶级

通常，将类权重设置为与特定类的样本数成反比是一个好主意。那么，在你的情况下，那就是4。然而，在实践中，您可能应该尝试几个不同的值来找到最佳权重

另一个重要方面是野生环境中这些样本的比例。您说过您减少了抽样，如果与您的训练数据集相比，在野外类的比率不同，那么在部署模型或在看不见的样本上测试模型时，您可能会观察到更差的分数。这就是为什么理想情况下，您还应该使用您的领域知识以实际比率分割验证集和测试集的原因。有，并包括一些超参数，以帮助我们到达那里

对于

scale\u pos\u weight

功能：

sum（负实例）/sum（正实例）

对于极不平衡的数据集，一些人建议使用上述公式的

sqrt

对于权重，通常通过XGBoost中的

sample\u weight

参数，您可以通过a学习

class\u weights

，如前所述

两者之间的区别是，但概括而言：

sample_weight参数允许您指定不同的权重对于每个培训示例。scale\u pos\u weight参数允许您为整个示例类提供权重（“积极”类）

在代码中，您可以看到下面这些实现，包括平方根。请注意，我必须使用合成数据，因为问题中没有提供任何数据

# General imports
import pandas as pd
from sklearn import datasets
from collections import Counter

# Generate datasets
from sklearn.datasets import make_classification
from imblearn.datasets import make_imbalance

# Train, test, splits and gridsearch optimization
from sklearn.model_selection import train_test_split, GridSearchCV

# Class weights
from sklearn.utils import class_weight

# Performance
from sklearn.metrics import classification_report

# Modeling
import xgboost

import warnings
warnings.filterwarnings('ignore')

# Generate synthetic data
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, class_sep=2.0, n_classes=2, n_clusters_per_class=5, hypercube=True, random_state=30)
scaled_X, scaled_y = make_imbalance(X, y, sampling_strategy={0:200}, random_state=8)
data = pd.DataFrame(data=scaled_X, columns=['feature_{}'.format(i) for i in range(X.shape[1])])
X_train, X_test, y_train, y_test = train_test_split(data, scaled_y, random_state=8, stratify=scaled_y)

# Compare 3 XGBoost models: no changes to weights, using sample weights, and using weight_scale

# Build a model without using the scale_pos_weight parameter, fit it, and get a set of its performance measures.
model_no_scale = xgboost.XGBClassifier(random_state=30)
model_no_scale.fit(X_train, y_train)
# Print performance
print("Off the Shelf XGBoost")
print(classification_report(y_test, model_no_scale.predict(X_test)))

# Get class_weights
# https://datascience.stackexchange.com/questions/16342/unbalanced-multiclass-data-with-xgboost
model_weights = xgboost.XGBClassifier(sample_weight=class_weight.compute_sample_weight(class_weight='balanced', y=scaled_y), random_state=30)
model_weights.fit(X_train, y_train)
# Print performance
print("Weights XGBoost")
print(classification_report(y_test, model_weights.predict(X_test)))

# Get the counts of the training data per XGBoost documentation
counts = Counter(y_train)
model_scale = xgboost.XGBClassifier(scale_pos_weight=counts[0] / counts[1], random_state=30)
model_scale.fit(X_train, y_train)
# Print performance
print("Scale XGBoost")
print(classification_report(y_test, model_scale.predict(X_test)))

# Get the counts of the training data per XGBoost documentation
from math import sqrt
model_sqrt = xgboost.XGBClassifier(scale_pos_weight=sqrt(counts[0] / counts[1]), random_state=30)
model_sqrt.fit(X_train, y_train)
# Print performance
print("SQRT XGBoost")
print(classification_report(y_test, model_sqrt.predict(X_test)))

结果：

Off the Shelf XGBoost
              precision    recall  f1-score   support

           0       0.95      0.38      0.54        50
           1       0.98      1.00      0.99      1253

    accuracy                           0.98      1303
   macro avg       0.96      0.69      0.77      1303
weighted avg       0.97      0.98      0.97      1303


Weights XGBoost
              precision    recall  f1-score   support

           0       0.95      0.38      0.54        50
           1       0.98      1.00      0.99      1253

    accuracy                           0.98      1303
   macro avg       0.96      0.69      0.77      1303
weighted avg       0.97      0.98      0.97      1303

Scale XGBoost
              precision    recall  f1-score   support

           0       0.73      0.64      0.68        50
           1       0.99      0.99      0.99      1253

    accuracy                           0.98      1303
   macro avg       0.86      0.82      0.83      1303
weighted avg       0.98      0.98      0.98      1303

SQRT XGBoost
              precision    recall  f1-score   support

           0       0.96      0.46      0.62        50
           1       0.98      1.00      0.99      1253

    accuracy                           0.98      1303
   macro avg       0.97      0.73      0.81      1303
weighted avg       0.98      0.98      0.97      1303