Python Scikit learn SequentialFeatureSelector输入包含NaN、无穷大或一个对于dtype(“float64”)太大的值。即使有管道

Python Scikit learn SequentialFeatureSelector输入包含NaN、无穷大或一个对于dtype(“float64”)太大的值。即使有管道,python,scikit-learn,pipeline,Python,Scikit Learn,Pipeline,我尝试使用SequentialFeatureSelector,对于估计器参数,我将传递一个管道,其中包括一个输入缺失值的步骤: model = Pipeline(steps=[('preprocessing', ColumnTransformer(transformers=[('pipeline-1', Pipeline(steps=[('imputing',

我尝试使用SequentialFeatureSelector,对于估计器参数,我将传递一个管道,其中包括一个输入缺失值的步骤:

model = Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('imputing',
                                                                   SimpleImputer(fill_value=-1,
                                                                                 strategy='constant')),
                                                                  ('preprocessing',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x1300013d0>),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('imputing',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('encoding',
                                                                   OrdinalEncoder(handle_unknown='ignore'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x1300015b0>)])),
                ('model',
                 LGBMClassifier(class_weight='balanced', random_state=1,
                                reg_lambda=0.1))])
编辑:

可复制示例:

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN

clf = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),("model",LogisticRegression(random_state = 1))])                                                                        

SequentialFeatureSelector(estimator = clf,
                           scoring= "accuracy",
                           cv = 3).fit(X, y)

它显示了相同的错误,尽管clf可以毫无问题地拟合

ScikitLearn的文档没有说明SequentialFeatureSelector可用于管道对象。它只允许类接受一个不合适的估计器。有鉴于此,您可以从管道中删除分类器,预处理X,然后将其与未匹配的分类器一起传递以进行特征选择,如下面的示例所示

import numpy as np
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MaxAbsScaler


X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN

pipe = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),
                ('scaler', MaxAbsScaler())])


# Preprocess your data
X = pipe.fit_transform(X)

# Run the SequentialFeatureSelector
sfs = SequentialFeatureSelector(estimator = LogisticRegression(),
                           scoring= "accuracy",
                           cv = 3).fit(X, y)

# Check which features are important and transform X
sfs.get_support()
X = sfs.transform(X)

您可以使用mlxtend包中的SequentialFeatureSelection


这样,您的模型就会有偏差,因为在cv文件夹中拆分之前,您正在预处理所有数据集。顺序特征选择器用于根据特征的重要性选择特征,而不是进行预测。但与其他sklearn特征选择包装器(如rfe)的不同之处在于,SFS是根据具有和不具有特征的模型之间的增量分数来计算重要性的。是的,它会做出预测。引用文档:这个顺序特征选择器添加正向选择或删除反向选择特征,以贪婪的方式形成特征子集。在每个阶段,该估计器根据估计器的交叉验证分数选择要添加或删除的最佳特征。这与拥有一个数据集、将其拆分为一个培训和测试集并完成整个预测工作流程不同。OP可以自由地查看哪些特性对LR很重要,或者是否也要转换数据集。
import numpy as np
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MaxAbsScaler


X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN

pipe = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),
                ('scaler', MaxAbsScaler())])


# Preprocess your data
X = pipe.fit_transform(X)

# Run the SequentialFeatureSelector
sfs = SequentialFeatureSelector(estimator = LogisticRegression(),
                           scoring= "accuracy",
                           cv = 3).fit(X, y)

# Check which features are important and transform X
sfs.get_support()
X = sfs.transform(X)
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import numpy as np

X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN

clf = Pipeline([
    ("preprocessing", SimpleImputer(missing_values= np.NaN)),
    ("model",LogisticRegression(random_state = 1))
])

sfs = SequentialFeatureSelector(estimator = clf, 
                                forward = True, 
                                k_features = 'best', 
                                scoring = "accuracy", 
                                cv = 3, n_jobs=-1).fit(X, y)
sfs.k_feature_idx_

>>> (0, 1, 2, 3)