Python Scikit learn SequentialFeatureSelector输入包含NaN、无穷大或一个对于dtype（“float64”）太大的值。即使有管道_Python_Scikit Learn_Pipeline

Python Scikit learn SequentialFeatureSelector输入包含NaN、无穷大或一个对于dtype（“float64”）太大的值。即使有管道

python scikit-learn

Python Scikit learn SequentialFeatureSelector输入包含NaN、无穷大或一个对于dtype（“float64”）太大的值。即使有管道,python,scikit-learn,pipeline,Python,Scikit Learn,Pipeline,我尝试使用SequentialFeatureSelector，对于估计器参数，我将传递一个管道，其中包括一个输入缺失值的步骤： model = Pipeline(steps=[('preprocessing', ColumnTransformer(transformers=[('pipeline-1', Pipeline(steps=[('imputing',

我尝试使用SequentialFeatureSelector，对于估计器参数，我将传递一个管道，其中包括一个输入缺失值的步骤：

model = Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('imputing',
                                                                   SimpleImputer(fill_value=-1,
                                                                                 strategy='constant')),
                                                                  ('preprocessing',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x1300013d0>),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('imputing',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('encoding',
                                                                   OrdinalEncoder(handle_unknown='ignore'))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x1300015b0>)])),
                ('model',
                 LGBMClassifier(class_weight='balanced', random_state=1,
                                reg_lambda=0.1))])

编辑：

可复制示例：

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN

clf = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),("model",LogisticRegression(random_state = 1))])                                                                        

SequentialFeatureSelector(estimator = clf,
                           scoring= "accuracy",
                           cv = 3).fit(X, y)

它显示了相同的错误，尽管clf可以毫无问题地拟合

ScikitLearn的文档没有说明SequentialFeatureSelector可用于管道对象。它只允许类接受一个不合适的估计器。有鉴于此，您可以从管道中删除分类器，预处理X，然后将其与未匹配的分类器一起传递以进行特征选择，如下面的示例所示

import numpy as np
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MaxAbsScaler


X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN

pipe = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),
                ('scaler', MaxAbsScaler())])


# Preprocess your data
X = pipe.fit_transform(X)

# Run the SequentialFeatureSelector
sfs = SequentialFeatureSelector(estimator = LogisticRegression(),
                           scoring= "accuracy",
                           cv = 3).fit(X, y)

# Check which features are important and transform X
sfs.get_support()
X = sfs.transform(X)

您可以使用mlxtend包中的SequentialFeatureSelection

这样，您的模型就会有偏差，因为在cv文件夹中拆分之前，您正在预处理所有数据集。顺序特征选择器用于根据特征的重要性选择特征，而不是进行预测。但与其他sklearn特征选择包装器（如rfe）的不同之处在于，SFS是根据具有和不具有特征的模型之间的增量分数来计算重要性的。是的，它会做出预测。引用文档：这个顺序特征选择器添加正向选择或删除反向选择特征，以贪婪的方式形成特征子集。在每个阶段，该估计器根据估计器的交叉验证分数选择要添加或删除的最佳特征。这与拥有一个数据集、将其拆分为一个培训和测试集并完成整个预测工作流程不同。OP可以自由地查看哪些特性对LR很重要，或者是否也要转换数据集。

import numpy as np
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MaxAbsScaler


X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN

pipe = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),
                ('scaler', MaxAbsScaler())])


# Preprocess your data
X = pipe.fit_transform(X)

# Run the SequentialFeatureSelector
sfs = SequentialFeatureSelector(estimator = LogisticRegression(),
                           scoring= "accuracy",
                           cv = 3).fit(X, y)

# Check which features are important and transform X
sfs.get_support()
X = sfs.transform(X)

from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import numpy as np

X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN

clf = Pipeline([
    ("preprocessing", SimpleImputer(missing_values= np.NaN)),
    ("model",LogisticRegression(random_state = 1))
])

sfs = SequentialFeatureSelector(estimator = clf, 
                                forward = True, 
                                k_features = 'best', 
                                scoring = "accuracy", 
                                cv = 3, n_jobs=-1).fit(X, y)
sfs.k_feature_idx_

>>> (0, 1, 2, 3)