Python 如何使用randomsearchcv优化F1成绩和预测速度?

Python 如何使用randomsearchcv优化F1成绩和预测速度?,python,scikit-learn,svm,prediction,Python,Scikit Learn,Svm,Prediction,我正在研究一个模型,该模型将在最终用户计算机上实时运行。因此,模型的预测速度至关重要 我已经有了一个优化F1成绩的RandomSearchCV 缺少的是在决定什么是最好的模型时以某种方式将精确速度结合起来 model = SVC() rand_list = {"C": stats.uniform(0.1, 10000), "kernel": ["rbf", "poly"], "gamma": stats.uniform(0.01, 1

我正在研究一个模型,该模型将在最终用户计算机上实时运行。因此,模型的预测速度至关重要

我已经有了一个优化F1成绩的
RandomSearchCV

缺少的是在决定什么是最好的模型时以某种方式将精确速度结合起来


 model = SVC()
 rand_list = {"C": stats.uniform(0.1, 10000),
              "kernel": ["rbf", "poly"],
              "gamma": stats.uniform(0.01, 100)}

 rand_search = RandomizedSearchCV(model, param_distributions = rand_list, 
                                  n_iter = 20, n_jobs = 5, cv = 5,
                                  scoring = "f1", refit=True)

 rand_search.fit(X_tr_val, y_tr_val)  #todo: adjust
 print("Validation score of best model: ", rand_search.best_score_)
 print("Best parameters: ", rand_search.best_params_)

我希望随机搜索能够对每个参数组合进行预测,以检查其速度。然后根据f1和速度的组合给出分数

伪代码:

def scoringFunc:
     score = f1 + SpeedOfThePrediction
     return score

rand_search = RandomizedSearchCV(model, param_distributions = rand_list, 
                                 n_iter = 200, n_jobs = 5, cv = 5, 
                                 scoring = scoringFunc, refit=True) 

有人知道我如何在对
随机搜索CV
的评分中使用预测速度吗?

有两个原因使得实现这个想法变得困难

  • f1-分数将在
    [0-1]
    范围内,而你所谓的
    预测速度将在更大范围内。因此,仅仅求和将失去f1分数的影响

  • RandomSearchCV
    中提供的评分方法仅将
    (y\u true,y\u pred)
    作为评分函数的输入参数。因此,您无法在评分方法中计算计算时间/
    预测速度

  • 来自,示例自定义评分功能:

    >>> from sklearn.model_selection import cross_validate
    >>> from sklearn.metrics import confusion_matrix
    >>> # A sample toy binary classification dataset
    >>> X, y = datasets.make_classification(n_classes=2, random_state=0)
    >>> svm = LinearSVC(random_state=0)
    >>> def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
    >>> def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
    >>> def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
    >>> def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
    >>> scoring = {'tp': make_scorer(tp), 'tn': make_scorer(tn),
    ...            'fp': make_scorer(fp), 'fn': make_scorer(fn)}
    >>> cv_results = cross_validate(svm.fit(X, y), X, y,
    ...                             scoring=scoring, cv=5)
    >>> # Getting the test set true positive scores
    >>> print(cv_results['test_tp'])  
    [10  9  8  7  8]
    >>> # Getting the test set false negative scores
    >>> print(cv_results['test_fn'])  
    [0 1 2 3 2]
    

    实施这一理念变得困难有两个原因

  • f1-分数将在
    [0-1]
    范围内,而你所谓的
    预测速度将在更大范围内。因此,仅仅求和将失去f1分数的影响

  • RandomSearchCV
    中提供的评分方法仅将
    (y\u true,y\u pred)
    作为评分函数的输入参数。因此,您无法在评分方法中计算计算时间/
    预测速度

  • 来自,示例自定义评分功能:

    >>> from sklearn.model_selection import cross_validate
    >>> from sklearn.metrics import confusion_matrix
    >>> # A sample toy binary classification dataset
    >>> X, y = datasets.make_classification(n_classes=2, random_state=0)
    >>> svm = LinearSVC(random_state=0)
    >>> def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
    >>> def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
    >>> def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
    >>> def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]
    >>> scoring = {'tp': make_scorer(tp), 'tn': make_scorer(tn),
    ...            'fp': make_scorer(fp), 'fn': make_scorer(fn)}
    >>> cv_results = cross_validate(svm.fit(X, y), X, y,
    ...                             scoring=scoring, cv=5)
    >>> # Getting the test set true positive scores
    >>> print(cv_results['test_tp'])  
    [10  9  8  7  8]
    >>> # Getting the test set false negative scores
    >>> print(cv_results['test_fn'])  
    [0 1 2 3 2]
    

    我想出了一个解决方案:

    def f1SpeedScore(clf, X_val, y_true):   
         time_bef_pred = time.time()
         y_pred = clf.predict(X_val)
         time_aft_pred = time.time()
         pred_speed = time_aft_pred - time_bef_pred
         n = len(y_true)
         speed_one_sample = pred_speed / n
    
         speed_penalty = (speed_one_sample * 1000) * 0.01 #0.01 score penality per millisecond 
         f1 = f1_score(y_true, y_pred)
    
         score = f1 - speed_penalty
    
         return score
    
    
         rand_search = RandomizedSearchCV(model, param_distributions = rand_list, 
                                          n_iter = iterations, n_jobs = threads, cv = splits, 
                                          scoring = f1SpeedScore, refit=True, verbose = verbose)   
    
    

    这会让事情变慢一点,因为你需要做额外的准备。但是,由于您只对计算近似速度感兴趣,因此可以在数据集的一小部分上运行预测,以加快计算速度。

    我找到了一个解决方案:

    def f1SpeedScore(clf, X_val, y_true):   
         time_bef_pred = time.time()
         y_pred = clf.predict(X_val)
         time_aft_pred = time.time()
         pred_speed = time_aft_pred - time_bef_pred
         n = len(y_true)
         speed_one_sample = pred_speed / n
    
         speed_penalty = (speed_one_sample * 1000) * 0.01 #0.01 score penality per millisecond 
         f1 = f1_score(y_true, y_pred)
    
         score = f1 - speed_penalty
    
         return score
    
    
         rand_search = RandomizedSearchCV(model, param_distributions = rand_list, 
                                          n_iter = iterations, n_jobs = threads, cv = splits, 
                                          scoring = f1SpeedScore, refit=True, verbose = verbose)   
    
    

    这会让事情变慢一点,因为你需要做额外的准备。但是,由于您只对计算近似速度感兴趣,因此可以在数据集的一小部分上运行预测以加快计算。

    我实际上找到了一个解决方案。见下文。虽然我不确定它在其他问题上的效果如何,但我确实找到了一个解决方案。见下文。虽然我不确定它在其他问题上的效果如何。