Scikit learn ValueError:feature_名称不匹配:在RandomizedSearchCV中使用XGB时,但不单独在XGB中使用

Scikit learn ValueError:feature_名称不匹配:在RandomizedSearchCV中使用XGB时,但不单独在XGB中使用,scikit-learn,xgboost,gridsearchcv,Scikit Learn,Xgboost,Gridsearchcv,我在xgboost上使用RandomizedSearchCV,并在IMBRearn管道中提前停止。如果使用RandomizedSearchCV运行代码,则会收到以下错误: File "/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 2131, in _validate_features data.feature_names)) ValueError: feature_names mismatc

我在xgboost上使用RandomizedSearchCV,并在IMBRearn管道中提前停止。如果使用RandomizedSearchCV运行代码,则会收到以下错误:

File "/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 2131, in _validate_features
    data.feature_names))
ValueError: feature_names mismatch: ['f0', 'f1', 'f2'...
这是密码

classifier = XGBClassifier(
                    n_estimators=200,
                    bootstrap=True,
                    objective = 'binary:logistic',
                    random_state=0,
                    verbosity=1
                    )

pipeline = Pipeline([
        
            ('union', FeatureUnion(  #Feature union vertically merges text, numeric, and categorical data for model ingestion
                transformer_list=[
                    ('categorical',Pipeline([ 
                        ('selector', ItemSelector(key=['region'])),
                        ('onehotencoder',encoder)
                    ])),
                    ('bow', Pipeline([ 
                        ('selector', ItemSelector(key='message')),
                        ('clean',CleanText()),
                        ('tfidf', tfidf_vectorizer)
                    ])),
                    ('text_stats', Pipeline([ 
                        ('selector', ItemSelector(key='message')),
                        ('stats', TextStats()),  
                        ('vect', DictVectorizer()) 
                    ]))
                ]
            )),
        
            ('model', classifier) # Modeling step
        ],verbose=2)
        
        pipeline_temp = Pipeline(pipeline.steps[:-1])
        pipeline_temp.fit(X_train,y_train)
        eval_set = [(pipeline_temp.transform(X_test),y_test)]
        
        param_dist = {  
            "model__max_depth": st.randint(32, 96),
            "model__learning_rate": st.uniform(0.05,0.4),
            "model__subsample": st.beta(10,1),
            "model__colsample_bytree": st.uniform(0.4,0.8),
            "model__gamma": st.uniform(0,5),
            "model__reg_alpha": st.expon(0, 50),
            "model__min_child_weight": [1,3,5,9]
        }
        searcher = RandomizedSearchCV(estimator=pipeline,
                                        param_distributions = param_dist,
                                        n_iter=10,
                                        cv=2,
                                        refit=False,
                                        verbose=3,
                                        n_jobs=-1,
                                        error_score='raise')
        
        searcher.fit(X_train,y_train,
                     model__early_stopping_rounds=20,
                     model__eval_metric="logloss",
                     model__eval_set=eval_set,
                     model__verbose=1)
        }
如果我在不使用RandomizedSearchCV的情况下将代码更改为使用pipeline.fit,那么代码执行时不会出错。所有github问题和Stackoverflow帖子似乎都指向DMatrix的xgboost问题,但由于它在RandomizedSearchCV之外运行,这似乎有些奇怪。我之所以选择使用参数
error\u score='raise'
,是因为我注意到所有测试分数值的cv\u结果都显示为NaN

版本:

我在两个开发环境中对此进行了测试

python3.6.10(与3.7.7中的版本相同)

sklearn 0.23.1

xgboost 1.1.1(与0.90版本相同)


imblearn 0.6.2(与0.7.0中的问题相同)

如果我在没有提前停止的情况下运行此程序,它也会在没有错误的情况下运行,并在cv\u结果中输出测试分数,因此问题似乎在提前停止的范围内。