Python 在进行交叉验证时，如何应用过采样？_Python_Machine Learning_Scikit Learn_Cross Validation_Imblearn

Python 在进行交叉验证时，如何应用过采样？

python machine-learning scikit-learn

Python 在进行交叉验证时，如何应用过采样？,python,machine-learning,scikit-learn,cross-validation,imblearn,Python,Machine Learning,Scikit Learn,Cross Validation,Imblearn,我正在研究一个用于分类的不平衡数据，之前我尝试使用合成少数过采样技术（SMOTE）对训练数据进行过采样。然而，这一次我认为我还需要使用一个离开一组（标志）交叉验证，因为我想在每个简历上离开一个主题我不确定我是否能很好地解释它，但是，据我所知，要使用SMOTE进行k折叠CV，我们可以在每个折叠上循环SMOTE，正如我在这段代码中看到的那样。下面是在k-fold CV上实现SMOTE的示例从sklearn.model\u选择导入KFold 从IMBRearn.over_采样导入SMOTE 从sk

我正在研究一个用于分类的不平衡数据，之前我尝试使用合成少数过采样技术（SMOTE）对训练数据进行过采样。然而，这一次我认为我还需要使用一个离开一组（标志）交叉验证，因为我想在每个简历上离开一个主题

我不确定我是否能很好地解释它，但是，据我所知，要使用SMOTE进行k折叠CV，我们可以在每个折叠上循环SMOTE，正如我在这段代码中看到的那样。下面是在k-fold CV上实现SMOTE的示例

从sklearn.model\u选择导入KFold
从IMBRearn.over_采样导入SMOTE
从sklearn.metrics导入f1\U分数
kf=KFold（n_拆分=5）
对于枚举（kf.split（X），1）中的折叠（列索引，测试索引）：
X_列=X[列索引]
y_列=y[列索引]
X_检验=X[检验指数]
y_检验=y[检验指数]
sm=SMOTE（）
X_序列过采样，y_序列过采样=sm.拟合样本（X_序列，y_序列）
模型=…#分类模型示例
模型拟合（X\U系列、y\U系列）
y_pred=模型预测（X_检验）
打印（f'用于折叠{fold}:'）
打印（f'准确性：{model.score（X_检验，y_检验）}'）
打印（f'f分数：{f1_分数（y_测试，y_预测）}'）

没有SMOTE，我试着做这个来做LOGO CV。但通过这样做，我将使用一个超级不平衡的数据集

X = X
y = np.array(df.loc[:, df.columns == 'label'])
groups = df["cow_id"].values #because I want to leave cow data with same ID on each run
logo = LeaveOneGroupOut()

logo.get_n_splits(X_std, y, groups)

cv=logo.split(X_std, y, groups)

scores=[]
for train_index, test_index in cv:
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    model.fit(X_train, y_train.ravel())
    scores.append(model.score(X_test, y_test.ravel()))

我应该如何在离开一组CV的循环中实现SMOTE？我对如何定义合成训练数据的组列表感到困惑。

这里建议的方法对于遗漏交叉验证更有意义。留下一个组作为测试集，并对其余的组进行重复采样。在所有过采样数据上训练分类器，并在测试集上测试分类器

在您的情况下，以下代码将是在LOGO CV循环中实现SMOTE的正确方法

for train_index, test_index in cv:
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    sm = SMOTE()
    X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
    model.fit(X_train_oversampled, y_train_oversampled.ravel())
    scores.append(model.score(X_test, y_test.ravel()))