Scikit learn ValueError:feature_名称不匹配:在RandomizedSearchCV中使用XGB时,但不单独在XGB中使用
我在xgboost上使用RandomizedSearchCV,并在IMBRearn管道中提前停止。如果使用RandomizedSearchCV运行代码,则会收到以下错误:Scikit learn ValueError:feature_名称不匹配:在RandomizedSearchCV中使用XGB时,但不单独在XGB中使用,scikit-learn,xgboost,gridsearchcv,Scikit Learn,Xgboost,Gridsearchcv,我在xgboost上使用RandomizedSearchCV,并在IMBRearn管道中提前停止。如果使用RandomizedSearchCV运行代码,则会收到以下错误: File "/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 2131, in _validate_features data.feature_names)) ValueError: feature_names mismatc
File "/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 2131, in _validate_features
data.feature_names))
ValueError: feature_names mismatch: ['f0', 'f1', 'f2'...
这是密码
classifier = XGBClassifier(
n_estimators=200,
bootstrap=True,
objective = 'binary:logistic',
random_state=0,
verbosity=1
)
pipeline = Pipeline([
('union', FeatureUnion( #Feature union vertically merges text, numeric, and categorical data for model ingestion
transformer_list=[
('categorical',Pipeline([
('selector', ItemSelector(key=['region'])),
('onehotencoder',encoder)
])),
('bow', Pipeline([
('selector', ItemSelector(key='message')),
('clean',CleanText()),
('tfidf', tfidf_vectorizer)
])),
('text_stats', Pipeline([
('selector', ItemSelector(key='message')),
('stats', TextStats()),
('vect', DictVectorizer())
]))
]
)),
('model', classifier) # Modeling step
],verbose=2)
pipeline_temp = Pipeline(pipeline.steps[:-1])
pipeline_temp.fit(X_train,y_train)
eval_set = [(pipeline_temp.transform(X_test),y_test)]
param_dist = {
"model__max_depth": st.randint(32, 96),
"model__learning_rate": st.uniform(0.05,0.4),
"model__subsample": st.beta(10,1),
"model__colsample_bytree": st.uniform(0.4,0.8),
"model__gamma": st.uniform(0,5),
"model__reg_alpha": st.expon(0, 50),
"model__min_child_weight": [1,3,5,9]
}
searcher = RandomizedSearchCV(estimator=pipeline,
param_distributions = param_dist,
n_iter=10,
cv=2,
refit=False,
verbose=3,
n_jobs=-1,
error_score='raise')
searcher.fit(X_train,y_train,
model__early_stopping_rounds=20,
model__eval_metric="logloss",
model__eval_set=eval_set,
model__verbose=1)
}
如果我在不使用RandomizedSearchCV的情况下将代码更改为使用pipeline.fit,那么代码执行时不会出错。所有github问题和Stackoverflow帖子似乎都指向DMatrix的xgboost问题,但由于它在RandomizedSearchCV之外运行,这似乎有些奇怪。我之所以选择使用参数error\u score='raise'
,是因为我注意到所有测试分数值的cv\u结果都显示为NaN
版本:
我在两个开发环境中对此进行了测试
python3.6.10(与3.7.7中的版本相同)
sklearn 0.23.1
xgboost 1.1.1(与0.90版本相同)
imblearn 0.6.2(与0.7.0中的问题相同)
如果我在没有提前停止的情况下运行此程序,它也会在没有错误的情况下运行,并在cv\u结果中输出测试分数,因此问题似乎在提前停止的范围内。