Python 矩阵(X)在自定义sklearn transformer类中降维(行)
给定下面的代码,我正在尝试构建的自定义transformer类(其目的是通过网格搜索添加一些列和runt)本身运行良好,但通过管道执行时,行中的维度会下降。也许有人能解释出什么地方出了问题,我显然遗漏了一些东西。搜索注释“这里发生了什么,行中的维度减少了?”在这里我有问题的打印。执行的完整代码可以在下面找到Python 矩阵(X)在自定义sklearn transformer类中降维(行),python,class,scikit-learn,Python,Class,Scikit Learn,给定下面的代码,我正在尝试构建的自定义transformer类(其目的是通过网格搜索添加一些列和runt)本身运行良好,但通过管道执行时,行中的维度会下降。也许有人能解释出什么地方出了问题,我显然遗漏了一些东西。搜索注释“这里发生了什么,行中的维度减少了?”在这里我有问题的打印。执行的完整代码可以在下面找到 import pandas as pd import numpy as np from sklearn.datasets import load_breast_cancer from sk
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn import linear_model
from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.base import clone
from sklearn.base import TransformerMixin
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
dict_breast_c = load_breast_cancer()
X = pd.DataFrame(dict_breast_c.data, columns=dict_breast_c.feature_names)
X.columns = [col.replace(" ", "_").replace("mean", "avg") for col in X.columns]
X_sub = X[[col for col in X.columns if col.find("avg") >= 0]]
X_feat_hldr = X[[col for col in X.columns if col not in (X_sub.columns)]]
y = pd.Series(dict_breast_c.target)
print ("Full X matrix shape: {}".format(X_sub.shape))
print ("Full feature holder shape: {}".format(X_feat_hldr.shape))
print ("Target vector: {}".format(y.shape))
class c_FeatureAdder(BaseEstimator, TransformerMixin):
def __init__(self, X_feat_hldr, add_error_feat = True, add_worst_feat = True): # no *args or **kargs
self.add_error_feat = add_error_feat
self.add_worst_feat = add_worst_feat
self.list_col_error = list_col_error
self.list_col_wrst = list_col_wrst
self.X_feat_hldr = X_feat_hldr
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
if self.add_error_feat and not self.add_worst_feat:
print ("Adding error (std) features:")
return np.c_([X, self.X_feat_hldr[self.list_col_error]])
elif not self.add_error_feat and self.add_worst_feat:
print ("Adding worst features:")
return np.c_([X, self.X_feat_hldr[self.list_col_wrst]])
elif self.add_error_feat and self.add_worst_feat:
# What happends here, dimensionality reduced in rows?
print ("Adding error (std) features and worst features")
print ("Feature: {}".format(self.list_col_error))
print ("Feature: {}".format(self.list_col_wrst))
print ("Something happends to number of rows! {}".format(X.shape))
print (self.X_feat_hldr.shape)
print (np.c_[X, self.X_feat_hldr[self.list_col_wrst].values, self.X_feat_hldr[self.list_col_error].values])
return np.c_[X, self.X_feat_hldr[self.list_col_wrst].values, self.X_feat_hldr[self.list_col_error].values]
else:
print ("Adding no new features, passing indata:")
return X
# Set a classifier, start with base form of logistic regression
clf_log_reg = linear_model.LogisticRegression(random_state=1234)
# Input into pipeline for doing feature adding to main data
list_col_error = [col for col in X_feat_hldr[0:2] if col.find("error") >= 0][0:1]
list_col_wrst = [col for col in X_feat_hldr[0:2] if col.find("worst") >= 0][0:2]
print (list_col_error)
print (list_col_wrst)
# Generate a pipeline of wanted transformers on data. End with classifier
pipe_log_reg = Pipeline(
[('add_feat', c_FeatureAdder(X_feat_hldr))
,('clf', clf_log_reg)]
)
# Set the parameter grid to be checked for pipe above. Only thing being changed is the adding of features through c_FeatureAdder() class
param_grid = {
'add_feat__add_error_feat' : [True, False]
,'add_feat__add_worst_feat' : [True, False]
,'clf__penalty' : ['l2', 'l1']
,'clf__C' : [1]
}
# Initialize GridSearch over parameter spacea
gs_lg_reg = GridSearchCV(
estimator = pipe_log_reg
,param_grid = param_grid
,scoring = 'accuracy'
,n_jobs = 1
)
# Assign names
X_train = X_sub.values
y_train = y.values
print (X_train.shape)
# Fit data
gs_lg_reg_fit = gs_lg_reg.fit(X_train
,y_train)
# Best estimator from GridSearch
gs_optimal_mdl_lg_reg = gs_lg_reg_fit.best_estimator_
你有一些错误 您正在变压器中使用全局变量:
self.list_col_error = list_col_error
self.list_col_wrst = list_col_wrst
如果您的转换器需要输入,它应该将其作为构造函数的参数(\uuuu init\uuu
)。避免在类中依赖全局变量
您的转换器应该能够转换任意数量的给定样本。
transform
函数的思想是变换任何给定的样本或样本集。实际上,您可能会保留您的转换器,然后使用它来转换任意数量的新给定样本。您应该使用fit
功能获取所需的任何输入,并相应地安装变压器。这个想法是,一旦你适应了整个管道,你应该能够给它一个样本,并从你的管道中得到这个样本的输出
GridSearchCV默认情况下进行3次交叉验证。
如合同所述:
cv:int,交叉验证生成器或iterable,可选
确定交叉验证拆分策略。cv的可能输入为:
无,要使用默认的三重交叉验证
这意味着它在每个阶段使用2/3的输入数据来适应管道。如果检查输出,您会看到代码抱怨新数据有379行,而旧数据有569行<代码>379/569=0.666080844。这就是更改的行数的来源。感谢您对我的代码的评论,非常感谢您抽出时间!根据您的输入,我在fit方法中重新编写了一些步骤来跟踪索引行。此外,全局变量上的指针也很好。现在测试分类器时一切都很顺利!