在DASK随机化搜索CV中实现SMOTEENN

在DASK随机化搜索CV中实现SMOTEENN,dask,smote,Dask,Smote,我成功地在管道中使用SMOTEENN和RF实现了一个模型。像这样: import random import numpy as np import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import RandomizedSearchCV from sklearn.metrics import roc_curve, roc_auc_score, co

我成功地在管道中使用SMOTEENN和RF实现了一个模型。像这样:

import random
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_curve, roc_auc_score, confusion_matrix
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
from imblearn.pipeline import Pipeline
加载数据并获得
X_列
X_测试
y_列
,以及
y_测试
矩阵后,我成功地执行了如下sklearn随机化搜索:

seed = 1706
knn = 10
smoted = SMOTE(sampling_strategy = 'auto',
               k_neighbors = knn,
               random_state = seed) 
mydata = pd.read_csv(datapath)
params_rf = {
  'rf__max_depth':[8, 14, 20, 26],
  'rf__min_samples_leaf':[8, 15, 22, 29],
  'rf__max_features':[6, 12, 18, 24, 30],
  'rf__n_estimators':[400, 800]
  }

smote_enn = SMOTEENN(smote = smoted)

rf = RandomForestClassifier(criterion = 'gini')

pipeline = Pipeline([('smote_enn', smote_enn), ('rf', rf)]) #<-pipeline with smote and model steps

random.seed(1706)
grid_rf = RandomizedSearchCV(estimator = pipeline,
                             param_distributions = params_rf,
                             scoring = 'roc_auc',
                             cv = 8,
                             n_jobs = cpu_count()-2,
                             refit = True,
                             return_train_score = False,
                             n_iter = 80)
grid_rf.fit(X_train, y_train.values.ravel())
seed=1706
knn=10
SMOTE=SMOTE(采样策略=‘自动’,
k_近邻=knn,
随机(状态=种子)
mydata=pd.read\u csv(数据路径)
参数rf={
“射频最大深度”:[8,14,20,26],
“rf_uuuMin_uSamples_uLeaf”:[8,15,22,29],
“射频最大功能”:[6,12,18,24,30],
“射频估值器”:[400800]
}
smote_enn=SMOTEENN(smote=smoted)
rf=随机性(标准=‘基尼’)

pipeline=pipeline([('smote_enn',smote_enn),('rf',rf)])\

我已经为dask ml做了一个PR来处理IMBRearn组件,您可以在这里找到它:


您可以将其作为临时解决方案,直到PR被接受。

它不起作用的原因是因为dask ml使用的是sklearn的
管道
,它不处理
拟合重采样
,也不将转换后的y传递到管道中

我已经为dask ml做了一个PR来处理IMBRearn组件,您可以在这里找到它:


您可以将其作为临时解决方案,直到PR被接受。

我使用Dask的RandomizedSearchCV遇到了相同的问题。显然,Dask要求您为每个组件实现
transform()
方法,而Sklearn的RandomizedSearchCV则没有。我将尝试找到一种方法来解决这个问题。我使用Dask的RandomizedSearchCV遇到了同样的问题。显然,Dask要求您为每个组件实现
transform()
方法,而Sklearn的RandomizedSearchCV则没有。我会设法解决这个问题。
from dask_ml.model_selection import RandomizedSearchCV as DaskRandomGridSearchCV
grid_rf = DaskRandomGridSearchCV(estimator = pipeline,
                                 param_distributions = params_rf,
                                 scoring = 'roc_auc',
                                 cv = 8,
                              ###n_jobs = cpu_count()-2, <-not needed b/c of dask
                                 refit = True,
                                 return_train_score = False,
                                 n_iter = 80)
grid_rf.fit(X_train, y_train.values.ravel())
AttributeError: 'SMOTEENN' object has no attribute 'transform'