Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/blackberry/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在管道中的分类器后使用度量_Python_Machine Learning_Scikit Learn_Pipeline_Grid Search - Fatal编程技术网

Python 在管道中的分类器后使用度量

Python 在管道中的分类器后使用度量,python,machine-learning,scikit-learn,pipeline,grid-search,Python,Machine Learning,Scikit Learn,Pipeline,Grid Search,我继续调查有关管道的情况。我的目标是只使用管道执行机器学习的每个步骤。它将更灵活,更容易将我的管道与其他用例相适应。所以我要做的是: 步骤1:填充NaN值 步骤2:将分类值转换为数字 步骤3:分类器 步骤4:网格搜索 步骤5:添加度量(失败) 这是我的密码: import pandas as pd from sklearn.base import BaseEstimator, TransformerMixin from sklearn.feature_selection import Sel

我继续调查有关管道的情况。我的目标是只使用管道执行机器学习的每个步骤。它将更灵活,更容易将我的管道与其他用例相适应。所以我要做的是:

  • 步骤1:填充NaN值
  • 步骤2:将分类值转换为数字
  • 步骤3:分类器
  • 步骤4:网格搜索
  • 步骤5:添加度量(失败)
这是我的密码:

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score


class FillNa(BaseEstimator, TransformerMixin):

    def transform(self, x, y=None):
            non_numerics_columns = x.columns.difference(
                x._get_numeric_data().columns)
            for column in x.columns:
                if column in non_numerics_columns:
                    x.loc[:, column] = x.loc[:, column].fillna(
                        df[column].value_counts().idxmax())
                else:
                    x.loc[:, column] = x.loc[:, column].fillna(
                        x.loc[:, column].mean())
            return x

    def fit(self, x, y=None):
        return self


class CategoricalToNumerical(BaseEstimator, TransformerMixin):

    def transform(self, x, y=None):
        non_numerics_columns = x.columns.difference(
            x._get_numeric_data().columns)
        le = LabelEncoder()
        for column in non_numerics_columns:
            x.loc[:, column] = x.loc[:, column].fillna(
                x.loc[:, column].value_counts().idxmax())
            le.fit(x.loc[:, column])
            x.loc[:, column] = le.transform(x.loc[:, column]).astype(int)
        return x

    def fit(self, x, y=None):
        return self


class Perf(BaseEstimator, TransformerMixin):

    def fit(self, clf, x, y, perf="all"):
        """Only for classifier model.

        Return AUC, ROC, Confusion Matrix and F1 score from a classifier and df
        You can put a list of eval instead a string for eval paramater.
        Example: eval=['all', 'auc', 'roc', 'cm', 'f1'] will return these 4
        evals.
        """
        evals = {}
        y_pred_proba = clf.predict_proba(x)[:, 1]
        y_pred = clf.predict(x)
        perf_list = perf.split(',')
        if ("all" or "roc") in perf.split(','):
            fpr, tpr, _ = roc_curve(y, y_pred_proba)
            roc_auc = round(auc(fpr, tpr), 3)
            plt.style.use('bmh')
            plt.figure(figsize=(12, 9))
            plt.title('ROC Curve')
            plt.plot(fpr, tpr, 'b',
                     label='AUC = {}'.format(roc_auc))
            plt.legend(loc='lower right', borderpad=1, labelspacing=1,
                       prop={"size": 12}, facecolor='white')
            plt.plot([0, 1], [0, 1], 'r--')
            plt.xlim([-0.1, 1.])
            plt.ylim([-0.1, 1.])
            plt.ylabel('True Positive Rate')
            plt.xlabel('False Positive Rate')
            plt.show()

        if "all" in perf_list or "auc" in perf_list:
            fpr, tpr, _ = roc_curve(y, y_pred_proba)
            evals['auc'] = auc(fpr, tpr)

        if "all" in perf_list or "cm" in perf_list:
            evals['cm'] = confusion_matrix(y, y_pred)

        if "all" in perf_list or "f1" in perf_list:
            evals['f1'] = f1_score(y, y_pred)

        return evals


path = '~/proj/akd-doc/notebooks/data/'
df = pd.read_csv(path + 'titanic_tuto.csv', sep=';')
y = df.pop('Survival-Status').replace(to_replace=['dead', 'alive'],
                                      value=[0., 1.])
X = df.copy()
X_train, X_test, y_train, y_test = train_test_split(
    X.copy(), y.copy(), test_size=0.2, random_state=42)

percent = 0.50
nb_features = round(percent * df.shape[1]) + 1
clf = RandomForestClassifier()
pipeline = Pipeline([('fillna', FillNa()),
                     ('categorical_to_numerical', CategoricalToNumerical()),
                     ('features_selection', SelectKBest(k=nb_features)),
                     ('random_forest', clf),
                     ('perf', Perf())])

params = dict(random_forest__max_depth=list(range(8, 12)),
              random_forest__n_estimators=list(range(30, 110, 10)))
cv = GridSearchCV(pipeline, param_grid=params)
cv.fit(X_train, y_train)
我知道打印roc曲线并不理想,但这不是现在的问题

因此,当我执行此代码时,我有:

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator Pipeline(steps=[('fillna', FillNa()), ('categorical_to_numerical', CategoricalToNumerical()), ('features_selection', SelectKBest(k=10, score_func=<function f_classif at 0x7f4ed4c3eae8>)), ('random_forest', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None,...=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)), ('perf', Perf())]) does not.
TypeError:如果未指定评分,则通过的估计员应采用“评分”方法。估计器管道(步骤=[('fillna',fillna()),('Category_to_Numeric',Category to_numerical()),('features_selection',SelectKBest(k=10,score_func=),('random_forest',RandomForestClassifier(bootstrap=True,class_weight=None,Criteria='gini'),
最大深度=无,…=1,oob\U分数=假,随机状态=无,
verbose=0,warm_start=False)),('perf',perf())])没有。

我对所有的想法都感兴趣

在错误状态下,您需要在GridSearchCV中指定评分参数

使用

GridSearchCV(管道,参数网格=参数,评分=准确度)

编辑(基于评论中的问题):

如果您需要整个X_序列和y_序列的roc、auc曲线和f1(而不是GridSearchCV的所有分割),最好不要将Perf类放入管道中

pipeline = Pipeline([('fillna', FillNa()),
                     ('categorical_to_numerical', CategoricalToNumerical()),
                     ('features_selection', SelectKBest(k=nb_features)),
                     ('random_forest', clf)])

#Fit the data in the pipeline
pipeline.fit(X_train, y_train)

performance_meas = Perf()
performance_meas.fit(pipeline, X_train, y_train)

伟大的但是用这种方法绘制roc曲线是不可能的?!这将有可能在同一管道中获得准确度和f1分数?是的,这是可能的。你没有得到结果吗?在进一步检查您的代码后,即使解决了这个问题,它似乎也会出现另一个错误。如果我删除我的
Class Perf
并调用
cv=GridSearchCV(pipeline,param\u grid=params,scoring='accurity')cv.fit(X\u train,y\u train)
我没有任何错误。我正试图找到一种方法,用我不懂的同样的符文获得roc、auc、f1_分数。您可以获得任何分数指标(f1、准确度、召回率),但问题是您希望在GridSearchCV中使用什么。?请参阅,当在管道中与GridSearchCV一起使用性能时,这意味着您需要GridSearchCV对数据执行的所有拆分的分数。如果您想访问所有数据的所有这些分数,最好将其排除在管道之外。你明白我的意思了吗?