Python 如何为RandomizedSearchCV使用预定义拆分_Python_Machine Learning_Scikit Learn_Data Science_Cross Validation

Python 如何为RandomizedSearchCV使用预定义拆分

python machine-learning scikit-learn

Python 如何为RandomizedSearchCV使用预定义拆分,python,machine-learning,scikit-learn,data-science,cross-validation,Python,Machine Learning,Scikit Learn,Data Science,Cross Validation,我正试图用RandomizedSearchCV规范化我的随机森林回归器。使用RandomizedSearchCV未明确指定训练和测试，我需要能够指定我的训练测试集，以便在分割后对它们进行预处理。然后我找到了，也找到了。但我仍然不知道如何做，因为在我的例子中，我使用交叉验证。我已经尝试从交叉验证中附加我的列车测试集，但它不起作用。它表示ValueError:无法将输入数组从shape（1824,9）广播到shape（1824），这是指我的X_测试 x = np.array(avo_sales.dr

我正试图用

RandomizedSearchCV

规范化我的随机森林回归器。使用

RandomizedSearchCV

未明确指定训练和测试，我需要能够指定我的训练测试集，以便在分割后对它们进行预处理。然后我找到了，也找到了。但我仍然不知道如何做，因为在我的例子中，我使用交叉验证。我已经尝试从交叉验证中附加我的列车测试集，但它不起作用。它表示

ValueError:无法将输入数组从shape（1824,9）广播到shape（1824）

，这是指我的

X_测试

x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)

kf = KFold(n_splits=10)

for train_index, test_index in kf.split(x):
    X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]

impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()

imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])

le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])

train_indices = X_train, y_test
test_indices = X_test, y_test
my_test_fold = np.append(train_indices, test_indices)
pds = PredefinedSplit(test_fold=my_test_fold)

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
rfr = RandomForestRegressor()
rfr_random = RandomizedSearchCV(estimator = rfr , 
                               param_distributions = random_grid,
                               n_iter = 100,
                               cv = pds, verbose=2, random_state=42, n_jobs = -1) <-- i'll be filling the cv parameter with the predefined split
rfr_random.fit(X_train, y_train)

x=np.array（avo_sales.drop（['TotalBags'，'Unnamed:0'，'year'，'region'，'Date']，1））
y=np.数组（avo_sales.TotalBags）
kf=KFold（n_分割=10）
对于列车索引，测试kf中的列车索引。拆分（x）：
X_序列，X_测试，y_序列，y_测试=X[序列索引]，X[测试索引]，y[序列索引]，y[测试索引]
impC=simplemputer（策略='most_frequency'）
X_列[：，8]=impC.fit_变换（X_列[：，8]。重塑（-1,1））.ravel（）
X_测试[：，8]=impC.transform（X_测试[：，8]。重塑（-1,1））.ravel（）
imp=simplemputer（strategy='median'）
X_列[：，1:8]=imp.fit_变换（X_列[：，1:8]）
X_测试[：，1:8]=imp.transform（X_测试[：，1:8]）
le=标签编码（）
X_列[：，8]=le.fit_变换（X_列[：，8]）
X_测试[：，8]=le.transform（X_测试[：，8]）
列车指数=X列车，y列车试验
检验指数=X检验，y检验
my_test_fold=np.append（训练索引，测试索引）
pds=预定义拆分（测试折叠=我的测试折叠）
n_估计量=[np.linspace（start=200，stop=2000，num=10）中x的int（x）]
最大功能=['auto'，'sqrt']
max_depth=[np.linspace（10110，num=11）中x的int（x）]
最大深度追加（无）
最小样本分割=[2,5,10]
min_samples_leaf=[1,2,4]
引导=[正确，错误]
随机网格={'n_估计量]：n_估计量，
“最大功能”：最大功能，
“最大深度”：最大深度，
“最小样本分割”：最小样本分割，
“min_samples_leaf”：min_samples_leaf，
“引导”：引导}
rfr=随机森林回归器（）
rfr_random=随机化搜索CV（估计器=rfr，
参数分布=随机网格，
n_iter=100，
cv=pds，verbose=2，random\u state=42，n\u jobs=-1）我认为你最好的选择是使用a加a。管道允许您指定多个计算步骤，包括预处理/后处理，列转换器将不同的转换应用于不同的列。在您的情况下，这将类似于：
pipeline = make_pipeline([
    make_column_transformer([
        (SimpleImputer(strategy='median'), range(1, 8)),
        (make_pipeline([
            SimpleImputer(strategy='most_frequent'),
            LabelEncoder(),
        ]), 8)
    ]),
    RandomForestRegressor()
])

然后您使用此模型作为正常估计量，使用通常的fit
和predict
API。特别是，您可以将其用于随机搜索：
rfr_random = RandomizedSearchCV(estimator = pipeline, ...)

现在，在拟合随机林之前，预处理步骤将应用于每个分割
如果没有进一步的调整，这肯定不起作用，但希望你能明白这一点。
首先，你的对于train\u索引，在kf.split（x）中测试\u索引：
毫无意义，因为你会在这个循环中覆盖折叠。在周期中加入打印，以便更好地了解您正在做的事情。第二，对于你的问题，使用cv=kf
，你就会实现你的目标。修复随机种子的再现性，您好，谢谢您的回答。但是，如果我删除了火车索引的，那么在kf.split（x）中测试火车索引：
我无法预处理火车测试集，这需要在分割火车测试集之后进行。我需要明确指定我的列车测试集，以便访问它们进行预处理，