Python Scikit learn SequentialFeatureSelector输入包含NaN、无穷大或一个对于dtype(“float64”)太大的值。即使有管道
我尝试使用SequentialFeatureSelector,对于估计器参数,我将传递一个管道,其中包括一个输入缺失值的步骤:Python Scikit learn SequentialFeatureSelector输入包含NaN、无穷大或一个对于dtype(“float64”)太大的值。即使有管道,python,scikit-learn,pipeline,Python,Scikit Learn,Pipeline,我尝试使用SequentialFeatureSelector,对于估计器参数,我将传递一个管道,其中包括一个输入缺失值的步骤: model = Pipeline(steps=[('preprocessing', ColumnTransformer(transformers=[('pipeline-1', Pipeline(steps=[('imputing',
model = Pipeline(steps=[('preprocessing',
ColumnTransformer(transformers=[('pipeline-1',
Pipeline(steps=[('imputing',
SimpleImputer(fill_value=-1,
strategy='constant')),
('preprocessing',
StandardScaler())]),
<sklearn.compose._column_transformer.make_column_selector object at 0x1300013d0>),
('pipeline-2',
Pipeline(steps=[('imputing',
SimpleImputer(fill_value='missing',
strategy='constant')),
('encoding',
OrdinalEncoder(handle_unknown='ignore'))]),
<sklearn.compose._column_transformer.make_column_selector object at 0x1300015b0>)])),
('model',
LGBMClassifier(class_weight='balanced', random_state=1,
reg_lambda=0.1))])
编辑:
可复制示例:
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN
clf = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),("model",LogisticRegression(random_state = 1))])
SequentialFeatureSelector(estimator = clf,
scoring= "accuracy",
cv = 3).fit(X, y)
它显示了相同的错误,尽管clf可以毫无问题地拟合ScikitLearn的文档没有说明SequentialFeatureSelector可用于管道对象。它只允许类接受一个不合适的估计器。有鉴于此,您可以从管道中删除分类器,预处理X,然后将其与未匹配的分类器一起传递以进行特征选择,如下面的示例所示
import numpy as np
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MaxAbsScaler
X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN
pipe = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),
('scaler', MaxAbsScaler())])
# Preprocess your data
X = pipe.fit_transform(X)
# Run the SequentialFeatureSelector
sfs = SequentialFeatureSelector(estimator = LogisticRegression(),
scoring= "accuracy",
cv = 3).fit(X, y)
# Check which features are important and transform X
sfs.get_support()
X = sfs.transform(X)
您可以使用mlxtend包中的SequentialFeatureSelection
这样,您的模型就会有偏差,因为在cv文件夹中拆分之前,您正在预处理所有数据集。顺序特征选择器用于根据特征的重要性选择特征,而不是进行预测。但与其他sklearn特征选择包装器(如rfe)的不同之处在于,SFS是根据具有和不具有特征的模型之间的增量分数来计算重要性的。是的,它会做出预测。引用文档:这个顺序特征选择器添加正向选择或删除反向选择特征,以贪婪的方式形成特征子集。在每个阶段,该估计器根据估计器的交叉验证分数选择要添加或删除的最佳特征。这与拥有一个数据集、将其拆分为一个培训和测试集并完成整个预测工作流程不同。OP可以自由地查看哪些特性对LR很重要,或者是否也要转换数据集。
import numpy as np
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MaxAbsScaler
X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN
pipe = Pipeline([("preprocessing", SimpleImputer(missing_values= np.NaN)),
('scaler', MaxAbsScaler())])
# Preprocess your data
X = pipe.fit_transform(X)
# Run the SequentialFeatureSelector
sfs = SequentialFeatureSelector(estimator = LogisticRegression(),
scoring= "accuracy",
cv = 3).fit(X, y)
# Check which features are important and transform X
sfs.get_support()
X = sfs.transform(X)
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import numpy as np
X, y = load_iris(return_X_y = True)
X[:10,0] = np.NaN
clf = Pipeline([
("preprocessing", SimpleImputer(missing_values= np.NaN)),
("model",LogisticRegression(random_state = 1))
])
sfs = SequentialFeatureSelector(estimator = clf,
forward = True,
k_features = 'best',
scoring = "accuracy",
cv = 3, n_jobs=-1).fit(X, y)
sfs.k_feature_idx_
>>> (0, 1, 2, 3)